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Abstract 


We  present  fast  methods  for  filtering  voltage  measurements  and  performing  optimal  inference 
of  the  location  and  strength  of  synaptic  connections  in  large  dendritic  trees.  Given  noisy,  subsam¬ 
pled  voltage  observations  we  develop  fast  b-penalized  regression  methods  for  Kalman  state-space 
models  of  the  neuron  voltage  dynamics.  The  value  of  the  Zi-penalty  parameter  is  chosen  using  cross- 
validation  or,  for  low  signal-to-noise  ratio,  a  Mallows’  Cp-like  criterion.  Using  low-rank  approxima¬ 
tions,  we  reduce  the  inference  runtime  from  cubic  to  linear  in  the  number  of  dendritic  compartments. 
We  also  present  an  alternative,  fully  Bayesian  approach  to  the  inference  problem  using  a  spike-and- 
slab  prior.  We  illustrate  our  results  with  simulations  on  toy  and  real  neuronal  geometries.  We 
consider  observation  schemes  that  either  scan  the  dendritic  geometry  uniformly  or  measure  linear 
combinations  of  voltages  across  several  locations  with  random  coefficients.  For  the  latter,  we  show 
how  to  choose  the  coefficients  to  offset  the  correlation  between  successive  measurements  imposed  by 
the  neuron  dynamics.  This  results  in  a  “compressed  sensing”  observation  scheme,  with  an  important 
reduction  in  the  number  of  measurements  required  to  infer  the  synaptic  weights. 


1  Introduction 


Understanding  the  synaptic  organization  of  local  neural  circuits  remains  a  central  challenge  in  neu¬ 
roscience.  To  make  progress  towards  this  aim  it  would  be  of  great  value  to  measure  the  full  synaptic 
connectivity  on  the  dendritic  tree.  In  particular,  we  would  like  to  quantify  not  just  which  neurons 
are  connected  to  a  given  cell,  but  also  where  these  synaptic  inputs  are  on  the  postsynaptic  dendritic 
tree,  and  with  what  strength  (Fig.  1).  Such  a  technique  would  help  in  addressing  a  variety  of  open 
questions  on  the  localization  and  maintenance  of  synaptic  plasticity  (Sjostrom  et  ah,  2008),  and 
would  facilitate  the  study  of  nonlinear  dendritic  computations. 

To  achieve  this  goal,  we  can  combine  the  ability  to  stimulate  individual  presynaptic  neurons 
with  high  temporal  resolution  (either  electrically,  via  intracellular  stimulation,  or  optically  (Packer 
et  ah,  2012)  and  to  simultaneously  image  postsynaptic  neurons  at  subcellular  spatial  resolution. 
In  particular,  we  can  use  two  available,  complementary  types  of  data  to  obtain  the  best  possible 
estimates: 

1.  Anatomical  measurements  of  the  postsynaptic  neuron’s  shape  and  dendritic  arborization.  This 
provides  a  backbone  on  which  we  can  build  a  dynamical  model  of  the  postsynaptic  cell. 

2.  Voltage-sensitive  fluorescence,  observed  at  subcellular  resolution.  Modern  imaging  methods 
can  access  small  dendritic  structures  and  allow  rapid  observations  from  many  spatial  locations 
(Reddy  and  Saggau,  2005;  Iyer  et  ah,  2006;  Vucinic  and  Sejnowski,  2007;  Kralj  et  ah,  2011). 
This  provides  access  to  the  key  dynamical  variable  of  interest,  the  spatiotemporal  subthreshold 
voltage. 

Since  current  voltage  imaging  technologies  have  relatively  low  signal-to-noise  ratio  (SNR)  (Djurisic 
et  ah,  2004;  Dombeck  et  ah,  2004;  Sacconi  et  ah,  2006;  Nuriya  et  al.,  2006;  Canepari  et  ah,  2007; 
Milojkovic  et  ah,  2007;  Fisher  et  ah,  2008;  Djurisic  et  ah,  2008;  Canepari  et  al.,  2008;  Peterka  et  ah, 
2011),  we  have  to  apply  optimal  filtering  methods  to  exploit  these  measurements  fully. 

In  this  paper  we  present  fast  methods  to  optimally  filter  these  voltage  measurements  and  infer 
the  location  and  strength  of  synaptic  connections  in  the  dendritic  tree.  The  problem  is  formulated 
in  a  state-space  model  framework  and  builds  on  fast  Bayesian  methods  that  we  have  previously 
developed  (Huys  et  ah,  2006;  Huys  and  Paninski,  2009;  Paninski  and  Ferreira,  2008;  Paninski,  2010; 
Huggins  and  Paninski,  2012;  Pnevmatikakis,  Paninski,  Rad  and  Huggins,  2012),  for  performing  op¬ 
timal  inference  of  subthreshold  voltage  given  noisy  and  incomplete  observations.  A  key  contribution 
of  this  work  is  to  note  that  these  fast  filtering  methods  can  be  combined  with  fast  optimization 
methods  from  the  sparse  Bayesian  literature  (Efron  et  ah,  2004)  to  obtain  a  fast  solution  to  this 
synaptic  estimation  problem. 

We  also  present  a  fully  Bayesian  approach  to  the  inference  problem,  using  a  spike-and-slab 
prior  (Mitchell  and  Beauchamp,  1988).  Although  computationally  more  intensive,  this  approach 
provides  confidence  intervals  for  the  estimates  of  the  synaptic  weights. 

An  additional  contribution  of  this  paper  is  in  the  area  of  experimental  design.  There  has 
been  much  interest  recently  in  experimental  implementations  of  the  compressed  sensing  paradigm 
(Nikolenko  et  al.,  2008;  Studer  et  al.,  2012),  which  allows  one  to  reconstruct  a  signal  that  is  sparse 
in  some  basis  from  a  small  number  of  measurements  (Candes  and  Wakin,  2008).  In  our  case,  the 
measurements  are  performed  on  voltages  with  a  temporal  dynamics  dictated  by  the  cable  equation 
of  the  neuron.  We  show  how  to  compensate  for  this  dynamics  in  the  voltage  observations  in  such  a 
way  that  the  compressed  sensing  results  apply  to  our  case. 

The  paper  is  organized  as  follows.  Section  2  presents  the  basic  ideas  of  the  inference  and  obser¬ 
vation  methods,  with  the  mathematical  details  presented  in  the  Appendices.  Section  3  illustrates 
our  results  with  simulated  data  on  a  toy  neuron  and  on  a  real  large  dendrite  tree.  We  conclude  in 
Section  4  with  some  possible  extensions  to  our  model. 
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Figure  1:  Schematic  of  proposed  method.  By  observing  a  noisy,  subsampled  spatiotemporal  voltage  signal  on 
the  dendritic  tree,  simultaneously  with  the  presynaptic  neuron’s  spike  train,  we  can  infer  the  strength  of  a  given 
presynaptic  cell’s  inputs  at  each  location  on  the  postsynaptic  cell’s  dendritic  tree. 


2  The  dynamical  model 

The  dynamical  model 

For  concreteness,  we  begin  with  the  following  model.  We  assume  that  observations  are  available  from 
a  neuron  with  N  compartments  in  which  the  passive  cable  dynamics  and  the  observation  equations 
are 


Vt+dt  =  AVt  +  WUt  +  et,  et  ~  Af(0,  a2dtl)  t  =  0,...,T-l  (2.1) 

Vt  =  BtVt+fk,  ilt~N{0,CyI)  t  =  1, . . .  ,T .  (2.2) 

In  the  first  equation,  Vt  is  an  unobserved  IV-dimensional  vector  of  compartment  voltages  at  time  t 
that  evolves  according  to  a  discretized  cable  equation  with  timestep  dt,  perturbed  by  a  Gaussian 
noise  source  et;  A f(n,C)  denotes  a  Gaussian  density  of  mean  /z  and  covariance  C  and  /  is  the 
identity  matrix  of  the  appropiate  dimension.  Assuming  we  can  stimulate  K  presynaptic  neurons 
in  a  controlled  manner,  Ut  represents  a  Af-dimensional  vector  of  known  presynaptic  signals  (the 
presynaptic  spike  times  filtered  by  some  fixed  synaptic  current  filter).  Finally,  W  is  the  N  x  K 
matrix  of  synaptic  weights  that  we  want  to  estimate. 

We  assume  an  experimental  setting  in  which  we  simultaneously  perform  S  voltage  observations 
at  each  discrete  time  t.  (S  could  vary  with  time,  but  to  keep  the  notation  manageable  we  will 
assume  that  S  is  fixed  here.)  In  the  second  equation,  (2.2),  yt  is  an  S'— dimensional  vector  of 
observations  related  instantaneously  to  Vt  by  the  SxN  matrix  Bt  that  specifies  how  the  observations 
are  performed.  We  will  discuss  below  several  forms  for  Bt.  Cy  is  the  noise  covariance  of  the 
observations,  which  depends  on  the  imaging  apparatus  used  in  each  experiment.  We  assume  this 
covariance  is  proportional  to  the  identity  for  simplicity,  i.e.  Cy  oc  /,  but  this  condition  can  also  be 
easily  relaxed. 

The  inverse  of  the  cable  dynamics  matrix  A  £  HlNxN  is  sparse  and  symmetric.  It  encodes  the 
membrane  leak  at  each  compartment  as  well  as  coupling  currents  between  adjacent  compartments. 
The  sparseness  of  A-1  follows  from  its  “tree-tridiagonal”  form:  the  off-diagonal  elements  A”1  are 
non-zero  only  if  compartments  n\  and  u-2  are  first  neighbors  in  the  dendritic  tree  (see  Hines  (1984) 
and  Paninski  (2010)  for  details).  Note  that  we  are  taking  explicit  advantage  of  the  fact  that  the 
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anatomy  of  the  imaged  cell  is  known  (or  at  least  may  be  reconstructed  post  hoc);  thus  we  know  a 
priori  which  compartments  are  adjacent,  and  specifying  the  matrix  A  comes  down  to  setting  a  few 
resistivity  coefficients,  if  the  diameter  of  each  compartment  can  be  estimated  reliably  (see  Huys  et  al. 
(2006)  for  further  discussion.)  In  general,  we  can  potentially  recover  A  and  er2  via  an  Expectation- 
Maximization  approach  (Huys  and  Paninski,  2009),  by  observing  the  neuron’s  response  to  varied 
subthreshold  current  injections,  before  any  presynaptic  spikes  are  elicited. 

This  linear  Gaussian  model,  with  a  passive  (i.e.  voltage  independent)  dynamic  matrix  A,  can  be 
a  valid  description  for  regimes  with  low  network  bring  rate,  so  that  the  postsynaptic  dendritic  tree  is 
in  a  subthreshold  state.  Furthermore,  we  assume  that  synaptic  inputs  are  sufficiently  small  that  the 
postsynaptic  response  may  be  treated  using  linear  dynamics.  (In  an  experimental  setting  we  may 
enforce  the  subthreshold  condition  pharmacologically,  by  blocking  voltage-gated  sodium  channels  in 
the  post-synaptic  cell.) 

On  the  other  hand,  real  neural  systems  are  known  to  depart  from  this  linear,  passive  Gaussian 
regime.  The  noise  can  be  non-Gaussian  and  strongly  correlated  (due,  e.g.,  to  unobserved  spikes  in 
the  circuit),  and  the  dynamics  equation  becomes  non-linear  when  voltage  dependent  conductances 
and  driving  forces  are  taken  into  account.  Also,  for  some  measurement  techniques,  the  observation 
equation  may  depart  from  the  form  (2.2).  We  discuss  some  of  these  generalizations  in  Section  4. 

The  likelihood  function  and  the  sparsity  penalty 

We  assume  that  in  equations  (2.1)-(2.2)  the  variables  and  parameters  are  as  follows: 

•  Known  /  Observed :  A,Uua2  ,dt,  Bt,yt,Cy. 

•  Unknown/Unobserved:  Vt ,  IT. 

If  the  system  evolves  during  T  time  units,  we  can  collect  all  the  voltages  Vt  into  the  ./VT-vector  V 
and  all  the  observations  yt  into  the  ,S'T- vector  Y .  The  complete  log-likelihood  for  the  combined  V 
and  Y  variables  is  (Durbin  et  al.,  2001) 

\ogp(Y,V\W)  =  logp(Y\V)  +  \ogp(V\W) 

T  T 

=  ^2^ogp(yt\Vt)  +  '^2,\ogp{Vt\Vt-x,  W)  +k>gp(Vi) 

t= 1  t= 2 

1  T 

=  -2  T,(vt-BWI'cv1(yt-Btvt) 

z  t= l 
1  T 

~2  -  WUt^)TCy\Vt  -  AVt-i  -  WUt _r) 

Z  t= 2 

—  ^  Vi  Cq  1  Vi  +  const. , 

where  Cy  =  a2dtl,  by  eq.  (2.1).  In  equation  (2.3),  \ogp{Y,V\W)  abbreviates 
\ogp{Y,  V\W,  U,  A,  a ,  B ,  Cy).  Equation  (2.3)  uses  the  factorization  p(Y,  V\W)  =  p{Y\V)p{V\W)  and 
the  first  sum  in  (2.4)  reflects  the  independence  of  the  measurements  yt  for  each  t.  The  second  sum 
in  (2.4)  follows  from  the  fact  that  the  probability  distribution  of  Vt  only  depends  on  V)_i  and  WUt. 
Finally,  equation  (2.5)  reflects  the  Gaussian  nature  of  each  term  in  (2.4),  as  follows  from  (2.1)-(2.2). 
In  the  last  term  we  assumed  E(Vi)  =  0  (having  parameterized  V  — >  V  —  Vrest  to  simplify  the  dy¬ 
namics  equation)  and  for  the  initial  state  covariance  Co  =  cov  ( V) )  we  choose  a  convenient  initial 
stationary  condition  on  which  we  elaborate  in  Appendix  C. 

The  log- likelihood  (2.3)  cannot  be  evaluated  because  it  involves  the  unobserved  voltages  Vt,  so  it 
is  useful  to  consider  p{Y\W),  obtained  by  marginalizing  the  voltages  as 

p(Y\W)  =  f  p(Y,V\W)  dV  .  (2.6) 


(2.3) 

(2.4) 

(2.5) 


3 


Since  p[Y,V\W)  is  Gaussian  in  (Y,V)  with  mean  linearly  dependent  on  W,  the  marginal  p(Y\W) 
has  the  same  property,  and  therefore  \ogp(Y\W)  is  quadratic  in  W, 

iogp(y|w)  =  E^’qE  WijMij,i'jlWi'j'  +  const.  (2.7) 

i,j 

where  i  =  1 . . .  N ,  j  =  1 . . .  K.  We  compute  the  coefficients  r,;,  and  in  Appendix  A.  In 

particular,  Mijyy  is  negative  semidefinite  and  symmetric. 

Our  goal  is  to  estimate  the  N  x  K  synaptic  weights  W  as 

W( A)  =  argmax  \ogp(W\Y,  A)  (2.8) 

w 

=  argmax  [logp(F|W)  +  \ogp{W\ A)]  ,  (2.9) 

w 

where  we  take  for  W  a  log-prior  with  an  li  penalty  that  enforces  a  sparse  solution, 

logp(IT|A)  =  —  A^  \  Wl'i\  +  const.  (2.10) 

The  prior  (2.10)  is  referred  to  as  the  lasso  prior  (for  ‘least  absolute  shrinkage  and  selection  operator’) 
in  the  statistics  literature;  its  effect  is  to  sparsen  the  estimated  W  in  (2.9),  i.e. ,  to  make  its  components 
exactly  0  if  they  do  not  have  a  strong  measurable  influence  on  the  observed  data  Y  (Tibshirani,  1996). 
In  our  case,  we  introduce  this  prior  because  we  expect  the  synaptic  contact  with  the  presynaptic 
neuron  to  occur  only  at  a  relatively  small  number  of  compartments  in  the  dendritic  tree. 

The  value  of  A  in  (2.10)  controls  the  sparsity  of  W{ A):  when  A  is  large,  the  number  of  non-zero 
components  in  IT  (A)  is  small,  and  vice  versa.  The  motivation  to  introduce  this  prior  here  is  that 
we  expect  the  number  of  non-zero  synaptic  weights  for  each  presynaptic  neuron  to  be  much  smaller 
than  the  number  of  compartments  N .  While  this  prior  turns  out  to  be  extremely  convenient  here 
(as  we  discuss  next),  more  involved  priors  are  possible;  see  section  4  for  some  further  discussion. 

Eq.(2.9)  is  a  concave  problem.  We  would  like  to  solve  it  for  a  range  of  values  of  A  and  then 
select  a  particular  A  according  to  some  criterion.  If  \ogp{Y\W)  were  the  quadratic  error  of  a  linear 
regression  with  Gaussian  noise,  the  solution  to  (2.9)  for  all  A  could  be  obtained  by  the  Least  Angle 
Regression  (LARS)  algorithm  introduced  in  Efron  et  al.  (2004). 

In  our  case  the  quadratic  expression  (2.7)  contains  contributions  from  the  integrated  unobserved 
voltages  V.  For  this  reason,  we  reformulate  in  Appendix  A.l  the  LARS  algorithm  of  Efron  et  al. 
(2004)  for  our  quadratic  function  of  W,  logp(Y|W).  Moreover,  synaptic  connections  are  either 
excitatory  or  inhibitory  (“Dale’s  law”),  so  the  non-zero  elements  of  each  of  the  K  columns  of  the 
true  synaptic  matrix  W  must  have  a  definite  sign.  We  consider  in  Section  A. 2  a  slight  modification 
of  the  LARS  algorithm,  LARS+,  that  imposes  this  sign  condition  and  avoids  inferring  weights  with 
the  wrong  sign  due  to  poor  observations  or  high  noise. 

As  we  show  in  Appendix  A,  the  ^-penalized  form  of  our  objective  function  in  (2.9)  implies  that 
the  solution  for  W( A)  is  a  piecewise  linear  function  of  A.  For  A  =  oo,  the  solution  to  (2.9)  is  W  =  0. 
As  A  becomes  smaller,  more  and  more  components  become  non-zero  at  the  breakpoints  of  IT(A), 
although  at  some  breakpoints  non-zero  components  can  also  become  zero.  The  form  of  the  solution 
is  thus 


IT(A) 


0  for  Ai  <  A 

W (A?;)  T  £q(Aj  —  A)  for  A^i  <C  A  <  A i  i  =  1 . . .  R , 


(2.11) 


where  a;  are  N  x  K  matrices  and  the  number  R  of  breakpoints  until  Afl+i  =  0  depends  on  the  data. 
The  LARS/LARS+  algorithm  proceeds  by  successively  computing  the  pairs  (A,,  a,)  until  A  =  0  is 
reached. 
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An  important  byproduct  of  the  algorithm  is  that  it  provides  us  with  an  estimate  of  the  unobserved 
voltages, 


V(X)  =  E[V\Y,W(\)\  (2.12) 

=  argmaxp(V|Y,  IU(A)) ,  (2.13) 

where  the  second  line  is  equal  to  the  first  because  p(V\Y,W  {X))  is  a  Gaussian  distribution.  The 
function  V (A)  will  be  important  below  to  select  the  best  value  for  A. 

The  value  of  W(X)  at  A  =  0,  the  end  point  of  the  path,  is  the  solution  to  the  unpenalized 
maximum  likelihood  problem 


W  =  argmax  \ogp(Y\W) ,  (2-14) 

w 

which  is  the  optimal  least-squares  (OLS)  linear  solution,  due  to  the  quadratic  nature  of  logp(Y|W). 
This  W  is  the  linear  combination  of  the  observations  yt  that  minimizes  the  log-likelihood  (2.7).  For 
LARS+,  the  end  point  of  the  path  is  the  minimum  of  (2.7),  with  the  restriction  that  each  of  the  K 
columns  of  the  inferred  synaptic  matrix  W  has  a  definite  sign. 


Computational  cost 

The  major  computational  challenge  in  obtaining  WlJ  (A)  lies  in  explicitly  computing  the  coeffi¬ 
cients  Tij  and  Mijjrji.  As  shown  in  Appendix  A,  computing  or  a  row  of  Mijyy  requires  that  we 
solve  a  linear  equation  involving  the  NT  x  NT  Hessian  matrix 


„  d2\ogp(Y,V\W) 

Hvv  = - Wav - 


(2.15) 


Using  the  block  tri-diagonal  structure  of  Hyy  (see  Appendix  C) ,  this  matrix  solve  can  be  performed 
in  0{TN3)  time  using  standard  methods.  However,  as  we  show  in  Appendix  C,  if  S  <C  N  (i.e. ,  only 
a  minority  of  compartments  are  imaged  directly,  as  is  typically  the  case  in  these  experiments)  we 
can  perform  this  operation  approximately  much  more  quickly,  in  0(TNS2)  instead  of  0(TN 3)  time, 
using  low-rank  perturbation  techniques  similar  to  those  introduced  in  Paninski  (2010);  Huggins  and 
Paninski  (2012);  Pnevmatikakis,  Paninski,  Rad  and  Huggins  (2012);  Pnevmatikakis  and  Paninski 
(2012). 

To  run  the  LARS/LARS+  algorithm  we  need  the  coefficients  rtJ- ,  and  at  each  breakpoint  in  which 
a  new  weight  W'1  becomes  non-zero,  the  ij- th  row  of  Mijyy  must  be  computed.  In  general  we  will 
need  to  compute  IUIJ(A)  up  to  k  <C  NK  breakpoints  (see  below),  leading  to  a  total  computational 
cost  from  acting  with  Hyy  of  0{kTN S2).  The  LARS/LARS+  algorithm  also  performs  some  smaller 
matrix  computations  at  each  breakpoint;  including  these,  the  total  computational  cost  is  0(kTNS2  + 
k3). 


Model  Selection 

The  next  task  is  to  select  the  point  along  the  LARS  path  Wl:l  ( A)  that  yields  the  “best”  inferred 
weights  Wv.  Two  major  methods  of  model  selection  for  our  case  include  cross-validation  and 
minimization  of  Cp-like  criteria  (see  e.g.  Efron  (2004)).  In  both  cases,  the  selected  model  is  not  that 
which  minimizes  the  squared  error  on  the  training  dataset.  This  solution  (corresponding  to  (2.14)) 
would  be  computationally  costly  and  typically  greatly  overfits  the  data,  unless  the  data  are  very 
informative  (i.e.,  high-SNR  and  T  N). 

We  will  consider  Mallows’  Cp  criterion  (Mallows,  1973)  in  the  low  signal-to-noise  ratio  (SNR) 
limit,  when  the  stochastic  term  in  (2.1)  can  be  ignored.  As  we  elaborate  in  Appendix  B,  in  this  limit 
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and  for  ST  >  NK ,  an  estimate  of  the  generalization  error  of  our  model  is  given  by 

T 

Cp(d)  =  II  Vt  ~  BtVt(\d)  ||2  +2 dCv  d  =  1, 2 . . .  NK,  (2.16) 

t= l 

where  Xd  is  the  smallest  value  of  A  at  which  there  are  d  non-zero  weights  in  the  path  (2.11).  The 
value  of  d  is  an  estimate  of  the  degrees  of  freedom  of  the  fitted  model  and  we  select  the  value  \d 
(or  equivalently  d)  that  minimizes  (2.16).  This  gives  the  best  compromise  between  the  fit  to  the 
data,  represented  by  the  first  term  (which  decreases  with  lower  A)  and  the  complexity  of  the  model, 
represented  by  the  factor  d  in  the  second  term  (which  increases  with  lower  A). 

As  we  discussed  above,  we  expect  the  number  of  non-zero  synaptic  weights  to  be  much  smaller 
than  NK.  Thus  one  can  stop  the  LARS  algorithm  if  after  k  <C  NK  steps  one  believes  that  the 
minimum  of  Cp(d)  was  attained  at  some  d  <  k.  This  is  often  possible  in  practice  (as  we  will  see 
below),  though  the  Cp  curve  is  typically  non-convex  in  d  or  A. 

The  cross-validation  approach  is  conceptually  somewhat  simpler.  In  2-fold  cross  validation,  we 
split  the  interval  T  into  2  segments  and  compute  the  IF,J'(d)  weights  using  data  from  one  of  the 
two  segments.  With  these  weights  we  evaluate  the  log-likelilrood  in  (2.7)  with  coefficients 
computed  from  the  left-out  segment.  We  repeat  the  procedure  (interchanging  the  training  and  test 
segments)  and  compute  an  average  likelihood  curve,  Q(d),  as  the  mean  of  the  two  test  log-likelihoods. 
We  select  the  model  d  at  the  minimum  of  this  curve.  We  finally  run  LARS/LARS+  on  the  whole 
interval  T,  and  estimate  the  value  Wl:>  for  d  active  weights. 

For  n-fold  cross  validation  with  n  >  2,  the  data  can  be  split  into  n  segments.  As  above  for  n  =  2, 
we  successively  leave  out  each  of  the  n  observation  subsets  and  use  the  remaining  observations  to 
compute  the  synaptic  weights  lF'tJ  (d).  In  this  case  the  held-out  test  set  will  lead  to  an  observed 
training  dataset  Y  with  a  gap  in  time  (where  the  test  set  was  removed).  The  likelihood  coefficients 
r,j ,  for  this  case  can  be  obtained  by  a  straightforward  application  of  the  method  developed 

in  Appendix  A,  but  for  simplicity  in  this  paper  we  only  consider  the  2-fold  case. 

Note  that  there  is  a  trade-off  between  these  two  methods.  The  Cp  criterion  is  computationally 
fast  because  we  only  have  to  run  the  LARS/LARS+  algorithm  once  in  order  to  compute  the  Cp(d) 
values  in  (2.16),  but  the  derivation  of  equation  (2.16)  assumes  low  SNR.  On  the  other  hand,  the 
cross  validation  method  is  computationally  more  intensive,  but  its  valid  for  any  SNR. 


A  fully  Bayesian  approach 

The  methods  presented  above  provide  us  with  point  estimates  of  the  synaptic  weights,  but  in  some 
situations  we  may  be  interested  in  confidence  intervals.  In  such  cases  we  can  adopt  a  fully  Bayesian 
approach,  and  consider  the  posterior  distribution  of  the  synaptic  weights  (not  just  the  maximum) 
under  a  sparsity-inducing  prior.  Among  several  possibilities  for  the  latter,  we  will  consider  here  the 
spike-and-slab  prior  (Mitchell  and  Beauchamp,  1988)  and  restrict  ourselves  to  one  presynaptic  signal 
(. K  =  1)  for  simplicity.  The  idea  is  to  augment  each  synaptic  weight  IF,;  with  a  binary  variable  s.; 
and  consider  the  prior  distribution 


N 


p(W,s\a,T 2)  =  JJp(IF,;|si,  T2)p(si\a) . 


where  each  pair  (IF,;,s,)  is  sampled  from 


spa  = 


IF,  I  Sj,  t2 


1  with  prob.  a , 

0  with  prob.  1  —  a  , 

5{Wi)  for  Si  =  0  , 

wf 

J_e~  2?a  for  s,'  =  1 . 


(2.17) 


(2.18) 

(2.19) 
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Note  that  the  sparsity  is  achieved  by  assigning  a  finite  probability  a  to  the  event  s,  =  W*  =  0.  On 
the  other  hand,  when  Sj  =  1,  IT)  is  sampled  from  the  distribution  in  (2.19),  which  is  a  Gaussian  with 
zero  mean  and  variance  r2.  In  general,  if  we  have  prior  information  about  the  synaptic  weights,  a 
and  r  could  depend  on  the  location  of  each  weight,  but  we  do  not  pursue  this  idea  here. 

For  the  hyperparameters  a  and  r2  we  will  use  conjugate  hyperpriors  (Gelrnan  et  ah,  2004):  a 
Beta  distribution  for  a  and  an  Inverse  Gamma  for  r2, 

p(a\aa,Pa)  °c  a“a(l  -  a)0a  ,  (2.20) 

p{T2\aT,/3T)  oc  T~2a--2e~^  .  (2.21) 

The  presence  of  two  hyperparameters  makes  the  spike-and-slab  similar  to  the  elastic  net  (Zou  and 
Hastie,  2005),  which  combines  both  lasso  and  ridge  penalties.  The  sparsity  parameter  a  corresponds 
roughly  to  the  lasso  A  parameter,  while  r~2  is  the  coefficient  of  the  ridge  penalty.  The  latter  is 
particularly  important  for  large  neurons  with  synaptic  weights  localized  in  several  nearby  locations: 
small  values  of  t  ,  which  can  be  favored  by  an  appropriate  hyperprior,  lead  to  similar  values  for 
these  correlated  weights  and  avoid  incorrectly  inferring  one  big  and  many  small  weights  (Zou  and 
Hastie,  2005). 

Given  the  observations  Y,  we  combine  the  prior  distribution  with  the  data  likelihood, 

p(Y\W)  oc  e^’iWiMiiW>+Y‘*riWi ,  (2.22) 

and  consider  the  joint  posterior  distribution 

p(W,  s,  a,  t2\Y)  oc  p(Y\W)p(W,s\a,T2)p(a\aa,l3a)p(T2\aT,/3T),  (2.23) 

which  we  can  sample  from  using  Markov  Chain  Monte  Carlo  (MCMC)  methods.  In  particular  we  use 

a  Gibbs  sampler  that  cyclically  samples  a,  r2  and  (W,  s ).  For  the  latter,  we  use  an  exact  Hamiltonian 
Monte  Carlo  sampler  based  on  the  method  of  (Pakrnan  and  Paninski,  2013).  The  sign  constraint 
from  Dale’s  law  can  be  imposed  by  simply  restricting  (2.23)  to  be  non-zero  only  for  Wi  >  0  or 
IT)  <  0.  Note  that  this  fully  Bayesian  approach  is  computationally  more  intensive  than  the  LARS 
method,  not  only  due  to  the  computational  cost  of  the  MCMC  sampling,  but  because  we  need  to 
pre-compute  the  full  M  matrix  in  (2.22). 

Note  that  in  (2.19)  we  assumed  a  (truncated)  Gaussian  prior  distribution  for  the  non-zero  synaptic 
weights.  This  choice  simplifies  the  sampling  from  the  posterior  (2.23),  but  it  is  known  that  heavy¬ 
tailed  distributions,  such  as  log-Normal,  are  a  more  realistic  choice  (Song  et  ah,  2005;  Barbour  et  ah, 
2007).  Although  we  will  not  consider  them  here,  such  priors  can  also  be  studied  using  appropriate 
MCMC  techniques,  see  (Smith,  2013)  for  details. 


Observation  Schemes 


An  observation  scheme  is  a  particular  selection  of  the  matrices  Bt  appearing  in  the  observation 
equation  (2.2).  The  simplest  such  matrix  would  be  the  identity,  i.e. ,  all  compartments  are  observed 
directly.  Since  it  is  currently  not  experimentally  feasible  to  observe  the  voltages  on  a  full  large 
dendritic  tree  in  three  dimensions  with  high  temporal  resolution,  we  will  consider  the  following  two 
cases  in  which  Bt  are  fat  rectangular  matrices,  i.e.,  the  number  of  observations  S  per  time  step  is 
less  than  the  number  of  compartments  N. 


•  Scan  observation. 

In  this  scheme,  the  S  x  N  observation  matrices  Bt  are 


f  1  for  j  =  p  *  i  +  t  mod  N  i  =  1. . .  S,  t=l...T,  p  €  N+ 

(  0  otherwise. 


(2.24) 


In  other  words,  we  observe  S  compartments  at  each  time,  at  changing  locations.  Each  row  has 
a  1  at  columns  separated  by  p  entries,  that  move  cyclically  at  each  observation.  Variations  of 
this  scheme  are  common  in  many  electro-physiological  experiments. 
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•  Compressed  sensing. 

This  relatively  new  paradigm  asserts  that  the  number  of  measurements  needed  to  reconstruct 
a  signal  is  much  smaller  for  sparse  than  for  non-sparse  signals.  For  a  review  and  references 
to  the  literature  see,  e.g.,  (Candes  and  Wakin,  2008).  To  see  how  this  applies  to  our  case, 
let  us  consider  the  case  with  no  noise  in  the  voltage  dynamics,  i.e.,  a  =  0.  As  shown  in 
eqs.(B.3)-(B.6),  in  this  limit  the  observations  yt  are  related  to  the  synaptic  weights  as 

Vt  =  DtW  +  Vt  (2.25) 

where 

Dt  =  BtFt  G  RSxN  (2.26) 

Ft  =  At-2U1+At~1U2  +  ---+Ut-i  (2.27) 

The  matrix  Dt  is  an  “effective”  observation  matrix.  Note  that  the  total  number  of  measure¬ 
ments  in  an  experiment  lasting  T  time  steps  is  ST,  and  suppose  that  W  has  K  non-zero  entries. 
If  the  entries  of  Dt  are  chosen  randomly  from  a  zero-mean  Gaussian  and  the  total  number  of 
measurements  obeys 

ST  >  ci K  log {N/K)  (2.28) 

for  some  constant  Ci ,  then,  with  overwhelming  probability,  the  quadratic  error  between  W  and 

its  lasso  estimate  W  is  bounded  as 

\\w  -W\\2  <c2Cy,  (2.29) 

for  some  constant  c2  (Candes  et  al.,  2006).  The  bound  (2.28)  stands  in  contrast  to  non-sparse 
signals,  which  require  O(N)  measurements  for  a  low-error  reconstruction.  The  experimental  re¬ 
alization  of  this  observation  scheme  is  presently  a  very  active  area  of  research,  see  e.g.  (Nikolenko 
et  al.,  2008;  Studer  et  al.,  2012). 

In  our  case,  we  can  implement  the  compressed  sensing  scheme  by  choosing,  at  each  t,  a  ma¬ 
trix  Dt  whose  entries  are  i.i.d.  samples  from  a  positive  Gaussian  distribution,  and  an  observa¬ 
tion  matrix 


Bt  =  DtFt~ 1 .  (2.30) 

A  potential  numerical  problem  arises  for  an  extended  set  of  times  without  external  stimulus  Ut. 
Suppose  Ut  =  0  for  t  =  t\  +  1, . . . ,  t2.  Then,  as  follows  from  (2.27), 

Ft2  =  At'2~tlFtl  ,  (2.31) 

and  since  the  matrix  A  is  stable  (i.e.,  its  eigenvalues  are  <  1),  the  matrix  Ft2  is  ill-conditioned, 
leading  to  a  numerical  instability  in  the  computation  of  (2.30).  So  this  observation  scheme 
is  better  applied  to  measurements  performed  at  those  times  t  in  which  a  stimulus  Ut  ^  0  is 
present. 


3  Results 

We  illustrate  our  methods  using  simulated  data  in  a  toy  model  and  in  a  real  neuronal  geometry.  In 
each  case,  we  started  with  a  known  matrix  A  (based  on  the  known  dendritic  geometery),  and  chose 
values  for  a2 ,dt,Cy,Ut  and  Bt.  We  sampled  values  for  the  dynamic  noise  et  and  obtained  values 
for  Vt  by  simulating  eq.(2.1)  forward  in  time.  We  next  sampled  values  for  the  observation  noise 
rjt  and  used  eq.(2.2)  to  obtain  yt.  In  all  the  cases  we  initialized  the  Vt  dynamics  so  that  Vt  was  a 
time-stationary  process,  ensuring  the  validity  of  the  approximations  discussed  in  Appendix  C.  All 
the  algorithms  were  implemented  in  MATLAB. 


Toy  neuron 

The  toy  model  neuron,  shown  in  Figure  2,  has  N  =  35  compartments  and  one  branching  point,  and 
we  assumed  three  positive  synaptic  weights,  indicated  by  circles  in  Figure  2.  Figures  3  to  9  show 
results  corresponding  to  one  presynaptic  input  (K  =  1),  with  a  stimulus  Ut  shown  in  the  upper  panel 
of  Figure  4.  For  the  scan  observation  scheme,  we  used  Bt  as  in  (2.24),  with  S  =  7  observations  at 
each  instant  and  column  spacing  p  =  5. 

Figure  3  illustrates  the  inferred  weights  as  a  function  of  A  for  both  the  LARS  and  LARS+ 
algorithms  with  scan  observation,  for  a  simulation  with  low  noise  covariance  Cy  =  0.05,  where 
in  a  slight  abuse  of  notation  we  abbreviate  Cy  =  [Oy]ii  (recall  that  the  observation  noise  Cy  is 
proportional  to  the  identity  here).  As  described  above,  the  weights  are  zero  for  A  =  oo  and  become 
active  at  breakpoints  as  A  becomes  smaller.  In  this  Figure  (as  in  Figures  5,  6,  8  and  9),  the  colors  of 
the  weights  correspond  to  their  location  in  Figure  2.  In  the  upper  panel,  corresponding  to  the  LARS 
algorithm,  all  the  weights  are  active  at  A  =  0.  On  the  other  hand,  for  LARS+  in  the  lower  panel, 
some  weights  never  become  active  because  that  would  violate  the  W  >  0  restriction.  The  vertical 
lines  show  the  weights  selected  by  the  Cp  and  2-fold  cross  validation  criteria. 

An  estimate  of  the  signal  power  for  V  can  be  obtained  as 

Ps  =  Mean,  (VartVJ(i))  ~  0.012,  (3.1) 

where  Vt(i)  is  the  voltage  at  compartment  i  and  time  t.  Using  this  value  we  can  estimate  the 
signal-to-noise  ratio  (SNR)  for  Figure  3  as 

SNR  =  Ps/Cy  ~  0.24  (3.2) 

Figure  4  shows,  along  with  the  presynaptic  signal,  the  true,  observed  and  estimated  voltages  for 
a  similar  simulation  with  scan  observations,  but  higher  noise  covariance  Cy  =  8  and  SNR  estimated 
as 


SNR  =  Ps/Cv  ~  0.0015 .  (3.3) 

Figure  4  also  shows  the  voltages  estimated  at  the  end  of  the  LARS  path,  the  OLS  point.  These 
estimates,  shown  in  the  fifth  panel,  are  of  poor  quality  compared  with  those  at  the  point  selected 
by  the  Cp  criterion,  shown  in  the  last  panel.  This  highlights  the  importance  of  the  l\  prior  in  the 
model.  The  higher  observation  noise  in  this  case  translates  into  a  solution  path  Wli  (A)  in  Figure  5 
in  which  the  weights  at  locations  with  zero  true  weights  grow  as  A  — >  0  to  values  comparable  to  the 
non-zero  true  weights  (i.e. ,  overfitting). 

The  weights  inferred  at  the  LARS+/CP  point  are  shown  in  more  detail  in  Figure  6.  An  interesting 
phenomenon  in  this  noisier  case  is  that  the  inferred  weights  are  locally  spread  around  their  true 
locations.  This  is  due  to  the  matrix  A  in  the  dynamical  equation  (2.1),  whose  inverse,  as  mentioned 
in  Section  2,  has  a  non-zero  off-diagonal  entry  A”1  if  compartments  ri\  and  712  are  first  neighbors. 

Figure  7  shows  Cp  and  cross-validation  statistics  for  these  data  as  a  function  of  the  degrees  of 
freedom  d,  for  LARS.  Note  that  the  minima  for  both  model  selection  curves  are  at  close-by  points. 

Figure  8  presents  a  comparison  of  scan  observations  and  compressed  sensing.  Across  20  simula¬ 
tions,  the  synaptic  weights  were  Up-estimated  as  a  function  of  the  experiment  time.  For  each  neuron 
compartments  the  median  and  .25/. 75  quantiles  of  the  estimated  weights  are  indicated.  In  each 
simulation,  the  observations  for  both  observation  schemes  were  made  on  the  same  data.  The  figure 
shows,  as  expected,  that  the  compressed  sensing  results  are  superior,  having  a  smaller  dispersion 
and  converging  faster  to  the  true  values. 

In  Figure  9,  we  examine  a  population  summary  of  the  results  of  100  simulations  with  the  same 
parameters  as  in  Figures  4  and  5.  As  expected,  the  LARS/LARS+  results  are  (downward)  biased 
and  have  low  variance,  and  the  OLS  results  are  unbiased  but  have  high  variance.  This  illustrates  the 
bias-variance  trade-off  involved  in  the  l\  regularization  (Gernan  et  ah,  1992).  Note  that  for  LARS+ 
the  values  above  the  median  are  less  dispersed  than  for  LARS. 
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Figure  2:  Toy  neuron  with  35  compartments  used  for  the  data  of  Figures  3-10.  The  three  compartments  with 
non-zero  synaptic  weights  for  the  simulations  with  one  presynaptic  signal  are  indicated  by  a  circle.  We  colored 
the  compartments  to  ease  the  visualization  of  the  corresponding  inferred  weights  in  Figures  3,  5,  6,  8  and  9. 


Figure  10  shows  the  stimuli  and  voltages  for  a  simulation  in  the  toy  neuron  with  Cy  =  8  and  two 
presynaptic  signals  ( K  =  2).  There  were  three  non-zero  weights  for  each  presynaptic  signal,  located 
at  different  compartments.  The  signal-to-noise  ratio  is  estimated  as 

SNR  =  Ps /Cv  ~  0.0016  (3.4) 

While  the  quality  of  the  voltages  estimated  in  Figure  10  is  good,  note  that  in  general  we  expect  the 
inferred  weights  and  voltages  to  lose  accuracy  as  K  grows,  since  we  have  to  estimate  more  weights, 
KN,  given  the  same  number  of  observations  yt  .  Thus  our  focus  here  is  on  modest  values  of  I\ . 

Real  neuron  geometry 

Figures  11  to  13  show  simulated  results  on  a  real  neuronal  geometry  with  N  =  2133  compartments, 
taken  from  the  THINSTAR  file1,  corresponding  to  a  rabbit  starburst  amacrine  cell  (Bloomfield  and 
Miller,  1986).  In  all  the  cases  we  considered  one  presynaptic  input  signal  (K  =  1)  and  estimated  the 
weights  with  the  Cp  criterion.  We  chose  28  compartments  with  non-zero  weights  (Fig.  12,  top  left 
panel).  In  all  the  figures  we  had 


SNR  =  Ps /Cv  ~  0.0034 ,  (3.5) 

and  the  observation  matrices  Bt  had  S  =  40  rows  and  N  =  2133  columns.  Inference  required  <  10 
minutes  to  process  700  timesteps  of  data,  with  k  =  140,  on  a  laptop  (Linux;  Core  2  Duo  processor, 
2.53GHz). 

Figure  11  shows  clearly  that  with  the  noisy  and  subsampled  voltages  of  the  third  panel  (‘Noisy 
Observations’),  obtained  with  scan  observations,  we  can  reconstruct  with  good  quality  the  full  spa- 
tiotemporal  voltages  in  the  last  panel. 

Figure  12  shows  the  true  and  inferred  synaptic  weights  in  the  THINSTAR  neuron  geometry 
for  20  simulations  of  scan  observations.  The  lower  left  panel,  showing  the  median  of  the  inferred 

1  Available  at  http://neuromorpho.org. 
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weights,  shows  that  our  method  is  able  to  infer  the  locations  of  almost  all  the  synaptic  locations, 
with  a  strength  slightly  biased  toward  lower  values.  To  measure  the  variability  of  the  results  across 
the  20  simulations,  we  computed  for  each  compartment,  the  quartiles  w. 75  and  w.25  of  the  .75 
and  .25  percentiles,  respectively,  of  the  weights  inferred  at  each  location  over  the  20  simulations. 
The  dispersion,  shown  in  the  lower  right  panel,  is  the  difference  A  =  re. 75  —  w.25,  computed  at  each 
compartment.  Comparing  the  dispersion  pattern  with  the  true  weights  shows  that  there  is  some 
variability  across  the  20  simulations  in  the  strength  of  the  inferred  weights,  but  minimal  variability 
in  the  location  of  the  inferred  synaptic  connections. 

Finally,  Figure  13  compares  the  median  of  the  inferred  weights  for  10  simulations,  in  scan  obser¬ 
vations  and  compressed  sensing,  for  both  a  short  experiment  of  T  =  200  ms.  and  a  long  experiment 
of  T  =  720  ms.  Again,  the  compressed  sensing  scheme  gives  better  results,  already  in  the  short 
experiment,  while  the  scan  observations  results,  while  not  optimal,  improve  from  the  short  to  the 
long  experiment. 

Some  additional  comments  are  in  order.  Firstly,  in  situations  with  high  noise,  some  small  non¬ 
zero  weights  tend  to  grow  along  the  LARS  path,  as  is  clear  in  Figure  5.  In  these  cases,  an  additional 
thresholding  of  the  final  inferred  weights  is  recommended.  Secondly,  the  geometry  of  the  neuron, 
encoded  in  the  matrix  A  in  the  dynamic  equation  (2.1),  is  not  always  known  with  full  precision. 
In  our  simulations,  we  have  noted  that  the  imposition  of  the  positivity  constraint  in  the  LARS+ 
algorithm  improves  significantly  the  robustness  of  the  inferred  weights  under  perturbations  of  the 
neuron  geometry.  Finally,  note  that  our  derivation  of  the  compressed  sensing  observation  matrices 
assumed  zero  noise  in  the  hidden  sector,  but  our  results  show  the  superiority  of  this  observation 
scheme  even  when  some  amount  of  noise  is  present. 

Bayesian  approach 

For  the  fully  Bayesian  approach  with  a  spike-and-slab  prior,  Figures  14  and  15  show  results  from 
samples  of  the  posterior  distribution  (2.23).  We  considered  an  experiment  in  the  THINSTAR  neuron 
with  scan  observations  and  T  =  400.  The  rest  of  the  parameters  were  similar  to  the  LARS/CP  results 
reported  above.  We  used  the  hyper-priors  (2.20)  and  (2.21),  with  parameters  aa  =  5,  0a  =  30,  aT  = 
20,  /3t  =  0.  Figures  14  and  15  show  results  from  1200  samples,  after  discarding  300  as  burn-in. 

A  point  estimate  for  the  synaptic  weights  can  be  obtained  from  the  mean  of  the  posterior  samples 
of  W, ,  which  allows  comparison  with  the  LARS/CP  results  for  the  same  dataset.  In  particular,  we 
found  that  the  mean  squared  error  (MSE)  (compared  with  the  true  weights)  was  8.48  for  the  Bayesian 
mean  and  9.89  for  the  LARS/CP  estimates. 
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Figure  3:  Toy  neuron  with  low  noise.  Active  weights  as  a  function  of  A  along  the  LARS  and  LARS+ 
paths  for  the  toy  neuron  model  of  Figure  2  with  SNR  ~  0.24,  7— dimensional  observations  and  one  pre-synaptic 
signal  (K  =  1).  The  simulation  was  run  for  T  =  500  ms;  The  A  axis  has  logarithmic  scale  and  the  three  true 
non-zero  weights  are  indicated  by  horizontal  dashed  lines.  The  colors  of  the  weights  correspond  to  their  location 
in  Figure  2.  The  models  selected  by  2-fold  cross  validation  are  indicated  by  a  vertical  dashed  line  and  by  the 
Cp  criterion  by  a  straight  line.  The  ticks  in  the  horizontal  axis  indicate  the  value  of  A  at  every  five  successive 
breakpoints.  The  high  quality  of  the  inferred  weights  is  due  to  the  relatively  high  SNR. 
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Figure  4:  Toy  neuron  with  high  noise.  Tracking  voltages  and  observations  for  the  toy  neuron  model  of 
Figure  2  with  SNR  ~  0.0015,  7— dimensional  observations  and  one  pre-synaptic  signal  (K  =  1).  The  simulation 
was  run  for  T  =  500  ms;  only  the  last  200  ms  are  displayed.  Top  panel:  presynaptic  signal  Utl  formed  by 
exponentially  filtering  the  (periodic)  presynaptic  spike  trains.  Second  panel:  true  voltages  evolving  according  to 
the  cable  equation  (2.1).  Third  panel:  noiseless  observations,  given  by  BtVt,  that  would  have  been  observed  had 
there  been  no  noise  term  rjt  in  the  observation  equation  (2.2).  Compartments  where  no  observations  are  taken  at 
a  given  time  t  are  left  at  zero.  Fourth  panel:  true,  noisy  observations  from  the  observation  equation  (2.2).  Fifth 
panel:  voltage  estimates  at  the  end  of  the  LARS  path,  the  OLS  point.  Bottom  panel:  inferred  voltages  V (A)  (see 
eq.  2.12)  estimated  using  the  sparse  LARS+  weights  selected  by  the  Cp  criterion.  The  poor  quality  of  the  OLS 
voltage  estimates  highlights  the  importance  of  the  l\  prior  in  the  model. 
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Figure  5:  Toy  neuron  with  high  noise.  Active  weights  as  a  function  of  A  along  the  LARS  and  LARS+ 
paths  for  the  data  shown  in  Figure  4  for  the  toy  model  neuron  of  Figure  2.  Conventions  as  in  Fig.  3.  Note  that 
inference  is  significantly  more  challenging  here.  The  LARS  solutions  select  weights  which  are  biased  downwards 
and  somewhat  locally  spread  around  the  corresponding  true  weights,  as  indicated  by  active  weights  with  similar 
colors.  Note  that  the  OLS  solution,  at  the  A  =  0  point  of  the  upper  panel,  performs  relatively  badly  here. 
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Figure  6:  Toy  neuron  with  high  noise.  True  and  Cp-selected  inferred  weights  for  the  data  described  in 
Figures  4  and  5.  Due  to  the  local  nature  of  the  dynamic  matrix  A,  there  are  non-zero  inferred  weights  in  the 
vicinity  of  the  original  weights.  Note  the  noisy  nature  of  the  OLS  results  compared  with  the  penalized  results. 
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Figure  7:  Cross-validation  and  Cp  curves.  Model  selection  curves  for  the  low  SNR  data  described  in 
Figures  4  and  5  using  LARS  inference.  The  upper  panel  shows  the  average  negative  log-likelihood  of  the  held- 
out  data  as  a  function  of  the  number  of  active  weights.  The  lower  panel  shows  the  values  of  the  Cp  criterion 
(eq.(2.16))  as  a  function  of  the  number  of  active  weights.  The  Cp  criterion  estimates  the  out-of-sample  error  and 
expresses  a  trade-off  between  a  better  fit  to  the  observed  data  and  the  simplicity  of  the  model.  The  minima  for 
both  model  selection  curves  are  indicated  by  circles. 
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Figure  8:  Compressed  Sensing  vs.  Scan  Observations.  LARS+/Cp-estimated  synaptic  weights  in 
20  simulations  of  the  toy  neuron  as  a  function  of  the  experiment  time.  For  each  of  the  35  compartments  the 
median  and  .25/. 75  quantiles  are  indicated.  The  average  SNR  was  2.18  and  the  dashed  lines  indicate  the  true 
weights.  In  each  simulation,  the  observations  for  both  observation  schemes  were  made  on  the  same  data  and  at 
the  same  times.  Note  that  the  compressed  sensing  estimates  reach  the  true  weight  values  in  shorter  experiments, 
as  expected. 
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Figure  9:  Toy  neuron  with  high  noise.  Distribution  of  LARS/Cp  and  LARS+/Cp  inferred  weights  for 
100  simulations  of  the  toy  neuron  described  in  4  and  5.  The  filled  rectangles  extend  from  the  25th  to  the  75th 
percentile  and  the  horizontal  lines  denote  the  median.  Both  the  LARS  and  the  LARS+  results  are  downward 
biased  and  have  low  variance,  and  the  OLS  results  are  unbiased  but  have  high  variance.  Note  that  for  LARS+ 
the  values  above  the  median  are  slightly  less  dispersed  than  for  LARS. 


18 


Presynaptic  signals  U( 


10 

Cl 

I  20 

30 


True  V 


T - 1 - 1 - 1 - T 


3 

2 

1 


Noiseless  Observations 


3 

2 

1 


Noisy  Observations 


Estimated  V  for  OLS  W 


10 

20 

30 


Estimated  V  for  LARS+/Cp 


20 


40 


60 


80 


100  120 
time  (ms) 


140 


160 


180 


200 


Figure  10:  Toy  neuron  with  high  noise  and  two  presynaptic  inputs.  Tracking  voltages  and  observations 
for  the  toy  neuron  of  Figure  2  with  Cy  =  8,  7— dimensional  observations,  two  pre-synaptic  signals  ( K  =  2) 
and  SNR  ~  0.0016.  Conventions  and  length  of  the  experiment  are  as  in  Figure  4,  except  top  panel  shows  two 
presynaptic  input  signals  (one  red  and  one  blue).  In  general  we  expect  the  quality  of  the  inferred  weights  and 
voltages  to  be  worse  as  K  grows. 
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Figure  11:  Big  neuron  with  scan  observations.  Tracking  voltages  and  observations  for  the  THINSTAR 
neuron  with  2133  compartments.  Note  that  for  space  reasons  the  axes  of  the  voltage  panels  are  inverted  compared 
to  the  previous  figures.  The  simulation  was  run  for  T=700  ms  and  the  presynaptic  neuron  spiked  every  6  ms. 
Since  the  full  N  x  T  voltage  matrix  is  too  large  to  examine  directly,  we  only  show  250  compartments  for  the  last 
60ms.  At  each  time  point  40  voltage  observations  were  made. 
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Figure  12:  Big  neuron  with  scan  observations.  True  and  inferred  synaptic  weights  in  the  THINSTAR 
neuron  for  20  simulations,  with  the  parameters  indicated  in  Figure  11.  Note  that  the  proposed  method  is  able  to 
infer  the  locations  of  almost  all  the  synaptic  locations,  again  with  a  slight  downward  bias.  Upper  right:  results  of 
a  single  experiment.  Lower  left:  median  results  over  all  20  experiments.  Lower-right:  dispersion  of  results  across 
all  20  experiments  (see  main  text  for  details).  Comparing  the  dispersion  pattern  with  the  true  weights  shows 
that  there  is  some  variability  across  the  20  simulations  in  terms  of  the  strength  of  the  inferred  weights,  but  the 
variability  in  terms  of  the  location  of  the  inferred  synapses  is  small. 
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Figure  13:  Comparison  of  observation  schemes.  2lrue  and  median  of  Gj, -selected  inferred  synaptic  weights 
for  10  simulations  in  the  THINSTAR  neuron.  The  scan  observation  results  improve  from  the  short  T  =  200  to  the 
long  T  =  720  experiment.  The  compressed  sensing  scheme  gives  optimal  results  already  in  the  short  experiment. 
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Figure  14:  Bayesian  Inference  with  spike-and-slab  prior.  Inferred  weights  for  a  synthetic  experiment 
in  the  THINSTAR  neuron  with  scan  observations  and  T  =  400.  The  Bayesian  approach  with  a  spike-and-slab 
prior  quantifies  the  uncertainty  of  the  inferred  weights,  both  through  the  posterior  variance  of  the  weights  (lower 
left  panel)  and  through  the  posterior  inclusion  probability  (lower  right  panel).  The  latter  is  defined  as  p(si\Y) 
for  each  weight  W),  with  s,  the  binary  variable  from  the  spike-and-slab  prior  (see  (2.18)-(2.19)).  The  results 
correspond  to  1200  samples,  after  discarding  300  as  burn-in.  In  the  three  panels  with  Bayesian  results,  we  set 
to  zero  all  weights  with  a  posterior  inclusion  probability  lower  than  0.1.  The  color  scale  of  all  the  panels  is  the 
same,  except  for  the  lower  right  panel. 
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Figure  15:  Samples  from  the  Spike-and-Slab  posterior  distribution.  Samples  from  the  posterior  distri¬ 
bution  (2.23)  for  the  data  shown  in  Figure  14.  Show  are  1200  samples,  after  discarding  300  as  burn-in.  Upper 
panel:  posterior  samples  from  the  sparsity  parameter  a.  Middle  panel:  posterior  samples  from  the  slab  parame¬ 
ter  t2.  Lower  panel:  posterior  samples  from  one  of  the  N  =  2133  synaptic  weights.  Note  that  many  samples  in 
the  lower  panel  are  zero,  since  the  posterior  inclusion  probability  for  this  weight  was  p(s|Y)  ~  0.48. 


4  Conclusion  and  Extensions 

Our  simulations  on  both  toy  and  real  neuronal  geometries  demonstrate  the  potential  utility  of  our 
techniques  for  inferring  the  location  and  strength  of  synaptic  connections  when  using  both  scan  and 
compressed  sensing  observations.  Numerical  simulations  indicate  that  LARS+  performs  better  than 
LARS  or  OLS  and  is  able  to  learn  the  synaptic  weights  even  under  low  SNR  conditions.  We  close 
by  noting  that  the  basic  model  we  have  considered  here  can  accommodate  several  possible  extensions. 

Robust  observation  model.  We  have  considered  so  far  only  Gaussian  noise,  for  both  the 
dynamics  and  the  observations.  However,  it  is  known  that  estimates  based  on  a  quadratic  objective 
function  can  be  very  non-robust  (i.e.,  sensitive  to  outliers).  The  standard  generalization  is  to  use 
a  log-likelihood  that  does  not  grow  quadratically,  but  instead  flattens  out  towards  linear  growth  as 
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the  errors  become  large,  as  in  the  Huber  loss  function  (Huber,  1964) 


f  x2 /2  for|a;|<i, 

\  t\x\  —  t2/2  for|a;|>t. 


(4.1) 


Note  that  this  loss  function  is  convex;  therefore,  log-likelihoods  chosen  to  be  proportional  to  the 
negative  of  this  loss  function  will  be  concave.  More  generally,  if  the  observation  log-likelihood  is 
concave,  then  the  inference  method  we  have  introduced  remains  tractable.  To  compute  the  optimal 
voltage  path, 

V(W)  =  argmaxlogp(VjY’,  W) ,  (4.2) 


we  can  use  Newton’s  method,  where  each  step  requires  one  call  to  the  Low-Rank  Block-Thomas 
algorithm  discussed  in  Pnevmatikakis  and  Paninski  (2012),  which  generalizes  the  method  outlined 
in  Appendix  C.  To  estimate  the  weights,  we  can  use  a  Laplace  approximation  to  logp(Y\W),  so  for 
each  A  we  want  to  solve 


W  =  argrnax  \ogp{Y\W)  +  \ogp(W\\)  (4.3) 

w 

~  argrnax  \ogp(V{W)\W)  +  \ogp{Y\V(W))  —  ^  log \—HVv\  +  logp(W|A) ,  (4.4) 

w  2 

where  the  Hessian  Hyy.  defined  in  (A. 6),  is  evaluated  at  V .  In  the  Gaussian  case,  the  first  two 
terms  are  quadratic  in  W,  and  Hyy  is  constant;  see  Huggins  and  Paninski  (2012)  for  further  details 
on  the  evaluation  of  the  log  |  —  Hyv  I  term.  For  most  reasonable  concave  observation  log-likelihoods, 
the  solution  IT(A)  is  continuous  in  A,  and  therefore  we  use  the  same  path- following  idea  exploited 
by  LARS  to  efficiently  compute  the  solution  path  W( A).  The  coordinate- wise  descent  algorithm 
of  Friedman  et  al.  (2007)  provides  one  efficient  solution  for  finding  an  optimal  IT  at  a  given  A  value, 
given  a  previous  solution  at  a  nearby  value  of  A. 


Slow  synapses.  A  slow  synapse  corresponds  in  our  model  to  the  filtered  arrival  of  the  presynaptic 
signal  Ut  at  several  delayed  times.  We  can  incorporate  such  a  scenario  by  modifying  the  dynamic 
equation  (2.1)  as 


D 

Vt+dt  =  AVt  +  WpUt-p  +  e«,  et  ~  Af{0,  a2dtl)  (4-5) 

p— 0 

where  each  Wp  is  a  N  x  K  synaptic  weights  matrix  for  the  stimuli  arriving  with  a  delay  of  p  time 
units.  (Equivalently,  it  is  possible  to  expand  Ut  in  a  different  basis  set  of  filter  functions.)  For  this 
case,  it  is  natural  to  modify  the  prior  p(W\X)  to  require  that  all  D  weights  at  a  given  compartment 
be  zero  or  non-zero  jointly.  This  can  be  enforced  with  the  grouped  lasso  prior  (Lin  and  Zhang,  2006; 
Yuan  and  Lin,  2006), 


log  p(W\X)  =  -Xj2\\WlJh 

i,3 


(4.6) 


with  ||IT*’-7||2  =  ^Ep=0(Wp’3)2,  which  is  known  to  encourage  solutions  for  which  the  “group”  of 

elements  are  held  at  zero  (for  a  given  i,j).  Again,  the  coordinate- wise  descent  algorithm 

of  Friedman  et  al.  (2007)  is  applicable  here. 

More  generally,  it  is  worth  noting  that  the  l\  (or  group-li)  prior  we  have  used  here  is  rather  sim¬ 
ple,  and  could  be  generalized  considerably.  In  many  cases  we  may  have  additional  prior  information 
that  can  be  exploited  statistically:  for  example,  we  may  know  from  previous  anatomical  studies  that 


25 


a  given  presynaptic  cell  type  might  prefer  to  synapse  on  the  postsynaptic  neuron  at  perisonratic  but 
not  distal  dendritic  locations.  This  can  easily  be  incorporated  here  by  varying  the  weight  of  the  l\ 
penalty  in  a  compartment-  and  cell-type-dependent  manner.  See  Mishchenko  and  Paninski  (2012) 
for  further  discussion. 

Other  observation  models.  We  can  also  incorporate  more  general  observation  models;  for 
example,  some  voltage  indicators  have  their  own  intrinsic  dynamics  (this  is  particularly  relevant  in 
the  case  of  genetically-encoded  indicators  (Knopfel  et  al.,  2006)),  which  can  be  incorporated  into 
either  the  dynamics  model  (as  an  additional  dynamical  variable)  or  the  observation  model  p(Y |V); 
see  Paninski  (2010)  for  details.  Another  important  direction  for  future  work  is  to  incorporate  cal¬ 
cium  measurements,  which  provide  higher-SNR  information  about  a  nonlinearly-thresholded  version 
of  the  voltage  signal  (Gobel  and  Helmchen,  2007;  Larkurn  et  ah,  2008;  Takahashi  et  al.,  2012;  Pnev- 
matikakis,  Kelleher,  Chen,  Saggau,  Josic  and  Paninski,  2012). 

Non-linear  effects.  It  will  also  be  important  to  generalize  our  methods  to  increase  their  robust¬ 
ness  to  nonlinearities  and  other  departures  from  the  basic  model  (2.1)-(2.2).  In  particular,  shunting 
effects  may  play  a  role  for  large  synchronous  inputs  to  nearby  compartments,  so  it  would  be  desirable 
to  have  conductance  terms  attached  to  each  compartment  (Paninski  et  ah,  2012).  Other  important 
effects  include  synaptic  depression  and  probabilistic  release;  these  random,  spike-history  dependent 
terms  can  be  handled  in  principle  using  Expectation-Maximization  methods  (Huys  and  Paninski, 
2009),  but  further  work  will  be  necessary  to  ascertain  the  effectiveness  of  these  methods  in  the  con¬ 
text  of  the  type  of  experimental  data  considered  here. 
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A  The  quadratic  function  and  the  LARS+  algorithm 


In  this  appendix  we  provide  the  details  of  the  algorithm  used  to  obtain  the  solution  W'HX)  for  all 
A,  where  i  =  1 . . .  N  indicates  the  neuron  compartment  and  j  =  1 ...  K  the  presynaptic  stimulus 
associated  with  the  weight.  To  simplify  the  notation,  define 


Q(W)  =  \ogp(Y\W) , 

Q(W,V)  =  \ogp(Y,V\W), 

where  these  expressions  are  related  by 

p{Y\W)  =  J p(Y,V\W)  dV. 

Let  us  first  obtain  an  explicit  expression  for  Q(W).  Recall  from  Section  2  that 

1  T 

Q(W,V)  =  -  YJ{yt~Btvt)Tc-\yt-Btvt) 


(A.l) 

(A.2) 


(A.3) 


(A.4) 


t=  1 
T 


Av*-'  -  WUt^fCyHVt  -  AVt_!  -  WUt _!) 


t=2 


—  -Vi  C0  Vi  +  const. . 

Since  Q(W,V)  is  quadratic  and  concave  in  V,  we  can  expand  it  around  its  maximum  V(W)  as 


Q(W,V)  =  Q(W,V)  +  -(V-V(W))tHvv(V-V(W)) 


where  the  NT  x  NT  Hessian 


Hvv  = 


d2Q{W. ,  V ) 


(A.5) 


(A.6) 


dVdV  ’ 

does  not  depend  on  Y  or  W,  as  is  clear  from  (A.4).  Inserting  the  expansion  (A.5)  in  the  integral  (A.3) 
and  taking  the  log,  we  get 


Q(W)  =  Q{W,V(W))+c 

=  \ogp(Y\V(W))  +  log  p(V(W)\W)  +  c 


(A.7) 

(A.8) 


where  c  =  —  4  log  \  —HVv\  +  log27r  is  independent  of  W. 

Since  V(W)  is  the  maximum  of  Q(W,  V),  its  value  is  the  solution  of  VyQ( IT,  V )  =  0,  given  by 


where 


Z(W)  =  NvQ(W,V)\v=o  = 


V(W)  =  —HyyZ(W) 


(  Bf  C~1yi  -  ATCy1WU1  \ 
BZC-'yi  -  ATCy1WU2  +  CyXWUx 
BlC~xyz  -  ATCy1WU3  +  Cy'WU? 


(A.9) 


V 


t,NT 


(A.10) 


J 


as  follows  from  (A.4).  It  is  useful  to  expand  Z(W)  as 

Z(W)  =  Z0  +  J2  ZiJWij 


(A.ll) 
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where  the  coefficients  Zq,  Zij  G  B.NT  can  be  read  out  from  (A.  10)  and  are  independent  of  W.  This 
in  turn  gives  an  expansion  for  V  in  (A. 9)  as 


V(W)  =  Fo  +  X  VijWiJ  G  Rnt 
id 

(A.12) 

where 

Vo  =  -Hyl^Zo  G  Rnt 

(A.13) 

%  =  -HyyZij  G  Rnt 

(A.14) 

are  independent  of  W.  Note  that  Vo  has  components 

/(Vo),\ 

j  j 

(A.15) 

\(Vo )t/ 

where  each  (Vo )t  is  an  N-vector,  and  similarly  for  each  Vij. 

To  obtain  the  explicit  form  of  Q(W)  one  can  insert  the  expansion  (A. 12)  for  V(W)  in  (A. 8).  But 
it  is  easier  to  notice  first,  using  the  chain  rule,  that 


dQ(W,V(W)) 

dW 


dQ(W,  V(W))  dQ{W ,  V(W))  dV{W) 
dW  +  W  dW 
8Q(W,V(W)) 
dW 

T 

CV 1  X^  -  AVt- 1  -  WUt^Ul, 

t—2 


(A. 16) 
(A.17) 
(A.18) 


where  the  second  term  in  (A.  16)  is  zero  since  V(W)  is  the  maximum  for  any  W.  Thus  once  V  is 
available,  the  gradient  of  Q  w.r.t.  W  is  easy  to  compute,  since  multiplication  by  the  sparse  cable 
dynamics  matrix  A  is  fast.  We  can  now  insert  (A. 12)  into  the  much  simpler  expression  (A.18)  to  get 


dQ{W,V{W)) 

dWv 


Mwj'W* 


with  i,i'  =  1 ...  N  and  j,  j'  =  1 . . .  K  and  coefficients 


(A.19) 


Mf. 


t—2 
1  T 

2^7/  —  i)  {Ut-i)j  —  (Ut-i)j(Ut~i)j'5n' 

1  t= 2  * 


(A. 20) 
(A.21) 


where  6u>  is  Kronecker’s  delta.  The  desired  expression  for  Q(W)  follows  by  a  simple  integration 
of  (A.19)  and  gives  the  quadratic  expression 

Q(W)  =  X!  'A11"  +  +  const.  (A.22) 

i,j 

where  i  =  1 . . .  N  ,  j  =  1 . . .  K.  Note  that  the  costly  step,  computationally,  is  the  linear  matrix  solve 
involving  Hyy  in  (A.13)-(A.14)  to  obtain  the  components  of  V,  which  are  then  used  in  (A.20)-(A.21) 
to  obtain  and  in  0(T)  time.  Note  that  we  do  not  need  the  explicit  form  of  Hyy,  only  its 

action  on  the  vectors  Zq,  Z^. 
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Matrix  form  of  coefficients 


For  just  one  presynaptic  signal  (K  =  1),  we  can  express  the  coefficients  of  the  log-likelihood  (A. 22) 
in  a  compact  form  by  defining  the  matrices 


P 


U 

B 


(—A  In 

—A  IN 


\ 


G  RNTxNT 


V 

(UJn  • 

(B1 

B-2 


- A  In 

0/ 

Ut-iIn  0)  G  RNxNT 

\ 


G  RSTxNT 


V 


(A. 23) 


(A. 24) 

(A. 25) 


and  =  Cy  1Ist ,  where  In  and  1st  are  identity  matrices  of  the  indicated  dimensions.  Using 
these  matrices,  the  expansion  (A.  12)  for  the  estimated  voltages  is 


V(W)  =  U0  +  VW 

with 

U0  =  —HyyBT C~^Y  G  Rnt 
V  =  (Ui  •••Uv) 

=  - Hy\rPTUTCy 1  G  RNTxN 

where  Y  in  (A. 27)  is 


(A. 26) 

(A. 27) 
(A. 28) 
(A. 29) 


(A. 30) 


The  coefficients  of  the  quadratic  log-likelihood  in  (A. 22)  can  now  be  expressed  as 


r  =  Cy'UPVo 

=  -~Cy1UPHylrBTC-'Y  G  Rn 

and 


(A.31) 
(A. 32) 


M  =  C^UPV-WUlfCy1  (A. 33) 

=  ~Cy1UPHy^PTUTCy1  -WUfCy1  G  (A.34) 

where  we  defined  ||U||2  =  Ym=i  U t  ■  Note  that  this  form  makes  evident  that  M  is  symmetric  and 
negative  semidefinite,  which  is  not  obvious  in  (A. 21).  In  matrix  form,  the  OLS  solution  is  given  by 


W 


argmax  W  r  H — W  MW 
w  2 

— M_1r 

-i? 


-  ( CylUUT  +  Cy1UPHy1yPTUTCy1)  Cy1UPHy1yBTC~^Y 

UP 


PTUTCy1UP 

- h  nvv 


btc^y, 


(A. 35) 
(A. 36) 
(A. 37) 

(A. 38) 
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(A. 39) 


where  in  the  last  line  we  used  the  identity 

(a-1  +  btc-1b)-1btc~1  =  ABt(BABt  +  C)"1 . 

A.l  LARS-lasso 

We  will  restate  here  the  LARS-lasso  algorithm  from  Efron  et  al.  (2004)  for  a  generic  concave  quadratic 
function  Q{W).  We  are  interested  in  solving2 

W( A)  =  arg  max  L( W,  A)  (A.40) 

w 

where 

N 

L(W,X)  =  Q(W)~\J2\wi\-  (A.41) 

i= 1 

As  we  saw  in  eq.(2.11),  the  solution  for  W  is  a  piecewise  linear  function  of  A,  with  components 
becoming  zero  or  non-zero  at  the  breakpoints. 

As  a  function  of  Wl,  L(W,  A)  is  differentiable  everywhere  except  at  Wl  =  0.  Therefore,  if  Wl  is 
non-zero  at  the  maximum  of  L(W,  A),  it  follows  that 


dL(W,  A) 
dWi 

for  r/o, 

(A. 42) 

or  equivalently 

ViQ(W)  =  ri  +  Miti>Wi'  = 

A  sign(Wl)  for  Wl  7^  0  , 

(A. 43) 

which  implies 

\ViQ(W)\  = 

A  for  r/0. 

(A. 44) 

For  A  =  oo,  one  can  ignore  the  first  term  in  (A. 41),  so  the  solution  to  (A.40)  is  clearly  W1  =  0.  One 
can  show  that  this  holds  for  all  A  >  Ai,  where 


Ai  =  max \ViQ\w=o  =  max  Kl  (A. 45) 

i  i 

Suppose,  without  loss  of  generality,  that  the  maximum  in  (A. 45)  occurs  for  i  =  1.  The  condition 
(A. 43)  will  now  be  satisfied  for  non-zero  W1,  so  we  decrease  A  and  let  W1  change  as 

A  =  Ai  —  7  (A. 46) 

W1^)  =  7 a1  7  e  [0,  Ai]  (A. 47) 

while  the  other  W’ s  are  kept  to  zero.  To  find  a1,  insert  (A. 47)  in  (A. 43), 

r\  +  M1170 1  =  (Ai  —  7)  sign(a 1) ,  (A. 48) 

from  which  we  get  ai  =  — ri/(AiMu).  Proceeding  in  this  way,  and  denoting  by  Wp(y)  the  vector  of 
weights  after  the  p-th  breakpoint,  in  general  we  will  have,  after  p  steps 

A  =  Xp  -  7  (A. 49) 

Wp(7)  =  linear  in  7  with  k  <  p  non-zero  components,  (A. 50) 

|ViQ(Wp(7/))|  =  Xp  —  7  i  =  l...k  non-zero  directions,  (A. 51) 

|Vi'Q(Wp(7/))|  <  Xp  —  7  i!  >  k  zero  directions,  (A. 52) 

and  we  let  7  grow  until  either  of  these  conditions  occurs: 

2We  omit  from  here  on  the  indices  j,j'  in  W:‘  and  :ji  to  simplify  the  notation. 
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1. 


At  7  =  7'  the  gradient  along  a  zero  direction,  say  Wk+1,  satisfies 


|Vfc+1Q(Wp(7'))|=Ap-7'- 


(A. 53) 


If  this  happens  we  let  Wk+1  become  active.  Define 

wp  =  wp(Y), 

Ap+i  =  Ap  —  7  , 

and  continue  with  k  +  1  components  as: 


Wp+i(7)  =  Wp  +  7a  = 


twl\ 

(  a1  \ 

+  7 

wk 

p 

ak 

V  0  j 

\ak+l  j 

7  £  [0,  Ap+i] 


A  —  Ap+i  —  7 

To  find  the  new  velocity  a,  insert  Wp+i(7)  into  (A. 43)  to  get 

(  Mn  •■•Af1(fc+1)  \  /  a1  \  /  \ 


a 

\ak+1J 


sign(Wk] 
\sign(ak+1) ) 


(A. 54) 
(A. 55) 


(A. 56) 
(A.57) 

(A. 58) 


\M(fc+i)i . . .  Af(fc+i)(fc+i)/ 

In  this  equation  we  need  sign(ak+1) ,  which,  as  we  show  in  Section  A. 3,  coincides  with  that  of 
the  derivative  computed  in  (A. 53), 

sign(ak+1)  =  sign{V k+1Q(Wp('y'))) .  (A.59) 


2. 


A  component  of  Wp(7),  say  Wk ,  becomes  zero  at  7  =  7'. 


(A. 60) 


If  this  happens,  Wk  must  drop  from  the  active  set  because  the  path  of  Wp(7)  was  obtained 
assuming  a  definite  sign  for  Wk  in  (A. 43).  So  we  define 


wp  =  wp(Y), 

Ap+i  =  Ap  7  > 

drop  Wk  from  the  active  set  and  continue  with  k  —  1  active  components  as: 


/  K1 


Wp+i(7)  =  Wp  +  7a  = 


7  G  [0,  Ap+i] 


Wi 


k- 1 
P 


~,k— 1 


A  =  Ap+i  -  7 

To  find  the  new  a,  inserting  Wp+i(7)  into  (A. 43)  gives 
Mu  •  •  •  Mi(k-i)  \  /  a1 


sign(W±) 


\M(k-i)l  ■  ■  ■  M(fc_i)(fc_i)  /  \a 

from  which  a  can  be  solved. 


k- 1 


Ksign(Wk  x); 


(A.61) 
(A. 62) 

(A. 63) 
(A. 64) 

(A. 65) 
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As  each  a  is  found,  we  decrease  A  by  increasing  7,  and  check  again  for  either  cases  1  or  2  until  we 
reach  A  =  0,  at  which  point  all  directions  will  be  active  and  the  weights  will  correspond  to  the  global 
maximum  of  Q(W). 

Having  presented  the  algorithm,  let  us  discuss  its  computational  cost.  To  obtain  pi  we  need 
to  act  with  Hyy  on  Z0  (see  (A.  13)  and  (A.20)).  Similarly,  for  each  new  active  weight  Wk+1  the 
(k  +  l)-th  column  of  M  is  needed  in  (A. 58),  which  comes  from  acting  with  Hyy  on  Zk+i  (see  (A. 14) 
and  (A. 21)).  The  action  of  Hyy  has  a  runtime  of  0(TN3),  but  in  Appendix  C  we  show  how  to 
reduce  it  to  0(TNS2)  with  a  low- rank  approximation.  For  the  total  computational  cost,  we  have 
to  add  the  runtime  of  solving  (A. 58).  Since  at  each  breakpoint  the  matrix  in  the  left-hand  side 
of  (A. 58)  only  changes  by  the  addition  of  the  ( k  +  l)th  row  and  column,  the  solution  takes  0(k2) 
instead  of  0(k3)  (Efron  et  al.,  2004).  Running  the  LARS  algorithm  through  k  steps,  the  total  cost 
is  then  0(kTNS 2  +  k3)  time. 


A. 2  Enforcing  a  sign  for  the  inferred  weights 

We  can  enforce  a  definite  sign  for  the  non-zero  weights  by  a  simple  modification  of  the  LARS-lasso. 
Assuming  for  concreteness  an  excitatory  synapse,  the  solution  to  (A. 40)  for  all  A  and  subject  to 

Wi  >  0 


can  be  obtained  by  allowing  a  weight  to  become  active  only  if  its  value  along  the  new  direction  is 
positive.  The  enforcement  of  this  condition  for  the  linear  regression  case  was  discussed  in  Efron  et  al. 
(2004).  In  our  formulation  of  the  LARS-lasso  algorithm,  the  positivity  can  be  enforced  by  requiring 
that  the  first  weight  becomes  active  when 


Ai  =  maxrj  r*  >  0 
i 


(A.66) 


and  by  replacing  the  condition  that  triggers  the  introduction  of  new  active  weights,  denoted  above 
as  condition  1,  by 


1. 


At  7  =  7'  the  gradient  along  a  zero  direction,  say  Wk+1,  satisfies 

Vfc+1Q(WP(7'))  =  +(AP  -  7') .  i  G  [0,  A„] 


(A. 67) 


By  requiring  the  derivative  along  Wk+1  to  be  positive  at  the  moment  of  joining  the  active  set,  we 
guarantee  that  Wk+1  will  be  positive  due  to  the  result  of  Section  A. 3. 

When  A  reaches  zero,  the  weights,  some  of  which  may  be  zero,  are  the  solution  to  the  quadratic 
program 


W  =  argmax  Q(W) ,  W1  >  0 .  (A. 68) 

w 

We  will  refer  to  the  LARS-lasso  algorithm  with  the  modification  (A. 67)  as  LARS+.  In  practice,  the 
measurements  can  be  so  noisy  that  the  algorithm  may  have  to  be  run  assuming  both  non-negative  and 
non-positive  weights,  and  the  nature  of  the  synapse  can  be  established  by  comparing  the  likelihood 
of  both  results  at  their  respective  maxima.  More  generally,  if  K  >  1  we  have  to  estimate  the  sign  of 
each  presynaptic  neuron;  this  can  be  done  by  computing  the  likelihoods  for  each  of  the  2K  possible 
sign  configurations.  This  exhaustive  approach  is  tractable  since  we  are  focusing  here  on  the  small- AT 
setting;  for  larger  values  of  A',  approximate  greedy  approaches  may  be  necessary  Mishchenko  et  al. 
(2011). 
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A. 3  The  sign  of  a  new  active  variable 

Property:  the  sign  of  a  new  variable  Wk+1  which  joins  the  active  group  is  the  sign  of  V k+tQ{W) 
at  the  moment  of  joining. 


Proof:3  Remember  that  the  matrix  Mlt/  is  negative  definite  and,  in  particular,  its  diagonal  elements 
are  negative 

Mu  <  0 ,  i  =  l...N  (A. 69) 

As  we  saw  in  Section  A.l,  if  the  first  variable  to  become  active  is 

Wi(7)  =  7a1  7  €  [0,  Ai]  (A. 70) 

with 


Ai  =  max  |ViQ|w=o  =  M  , 
i 


we  have 

and  using  (A. 69)  and  Ai  >  0  we  get 


n 

A 1  Mu 


sgn{ai)  =  sgn(ri) 

as  claimed.  Suppose  now  that  there  are  k  active  coordinates  and  our  solution  is 


Wp(7) 


/W^)\ 

\Wkh)J 


7  €  [0,  Ap] 


Define 

cj{l)  =  VjQ(Wp(7)) , 

and  note  that 

lcf  (7)1  =  Ap  —  7  j  =  l...k. 

Suppose  a  new  variable  Wk+1  enters  the  active  set  at  7  =  7'  such  that 


(A.71) 
(A. 72) 
(A. 73) 

(A. 74) 

(A.75) 
(A. 76) 


|cfc+1(7')l  =Ap  — 7'  (A. 77) 

It  is  easy  to  see  that  when  taking  7  all  the  way  to  Ap,  the  sign  of  Cfc+i(7)  does  not  change 

sgn(ck+ 1(7'))  =  sgn(ck+i(Xp))  (A. 78) 

since  the  Cj( 7)  (j  =  1, . . .  k)  go  faster  towards  zero  than  Cfc+i(7).  To  make  the  variable  Wk+l  active, 
define 


Ap+i  —  Ap  7  i 

Wp  =  Wp(Y), 


and  continue  with  k  +  1  components  as: 

3This  is  a  recasting  of  Lemma  4  in  Efron  et  al.  (2004). 


(A. 79) 
(A. 80) 
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Wp+i(7)  =  Wp  +  7a  = 


(wl\ 

+  7 

(  \ 

wk 

ak 

V  0  ) 

\ak+1J 

7  €  [0,  Ap+i] 


A  —  Ap+i  —  7 

To  find  a,  impose  on  (A. 81)  the  conditions  (A. 43)  that  give 

P  +  M(t+u+i)WP+i(7)  =  (Ap+i  —  7)s 

where  p  =  (pi, . . .  ,Pk+ i)T,  M(fc+l  fc+1)  is  the  (, k  +  1)  x  (k  +  1)  submatrix  of  My,  and 

(  sgn(W' )  \ 


(A.81) 
(A. 82) 
(A. 83) 


s  = 


sgn(Wk) 


(AM) 


\sgn(ak+1)J 

Since  (A. 83)  holds  for  any  7,  we  get  the  two  equations 

P  +  M(fc+ljfc)Wp  =  Ap+is 

and 

M, 


(AM) 

l(fc+i,fc+i)a  =  —s  •  (A. 86) 

where  M(fc+1+.)  is  obtained  from  M(fc+1)fe+1)  by  eliminating  the  last  column.  Inserting  (A. 85) 
into  (A. 86)  we  get 


a  —  y  +1  M(fc+ijfc+i)(p  +  M(fc+i!fe)Wp) 

(p  +  M(fe+i!fc)Wp(Ap)  —  M(fc+ljfe)Wp(Ap)  +  M(fc+1)fc)Wp)  (A. 


(A. 87) 


Ap+i 

1 


M 


-1 


0 


1 


(Wp-Wp(Ap)) 


(A. 89) 


-  Ap+1  (fc+i’fc+i)  \Cfe+1(Ap)y  Ap+1  v  p 
where  0  has  k  elements.  Since  the  (k  +  l)-th  element  of  the  second  term  in  (A. 89)  is  zero,  we  get 

(■^■(fe+i.fe+i)) 


ak+1  =  —  - 


(fc+i)(fc+i) 


Ap+i 
t— 1 


Ck+ 1  ( Ap) . 


(A. 90) 


Since  M^fc+1  k+1 ^  is  negative  definite,  we  have  (M(.fc+1  fe+1))(fc+i)(fe+i)  <  0,  so  using  (A. 78),  the  result 


sgn(ak+1)  =  sgn(ck+i(  7')) 


(A.91) 


follows.  □ 


B  The  Cp  criterion  for  low  SNR 

In  the  limit  of  very  low  signal-to-noise  ratio,  we  can  ignore  the  dynamic  noise  term  in  eq.  (2.1)  and 
consider 


Vt+at  =  AVt  +  WUt 

yt  =  BtVt+riu  %  ~  Af(0,  CyI). 


(B.l) 

(B.2) 
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Let  us  assume  that  the  number  of  presynaptic  neurons  is  K  =  1  to  simplify  the  formulas.  The  results 
can  be  easily  extended  to  the  general  case.  We  can  combine  the  above  equations  as 


Y  =  XW  +  ri , 


where  we  defined 


Y  = 


y  i 


\VT/ 


77  = 


WTJ 


and  the  matrix  X  is  given  by  the  product 

X  =  BC 

where  B  was  defined  in  (A. 25)  and 

/  0 


vSTxN 


c  = 


Ui 

AUi  +  U2 
A~U\  +  AU2  +  Uo { 


t>NTxN 


(B.3) 


(B.4) 


(B.5) 


(B.6) 


V  AT~2U 1  +  •  •  •  +  UT-1  ) 

Equation  (B.3)  corresponds  to  a  standard  linear  regression  problem  and  the  Zi-penalized  posterior 
log-likelihood  to  maximize  is  now 


1  N 

logp(W|y,  A)  =  -~\\Y  -  AW||2  -  A X]  1^1 . 


(B.7) 


The  solution  W{ A)  that  maximizes  (B.7)  is  obtained,  as  in  the  general  case,  using  the  LARS/LARS+ 
algorithm,  and  the  fitted  observations  are  given  by 

Y(X)  =  BCW{ A) .  (B.8) 

One  can  show  that  each  row  in  C'W'(A)  corresponds  to  the  Cy  — >  0  limit  of  the  expected  voltage  Vt(\) 
dehned  in  (2.12).  Given  an  experiment  (Y.  U ),  consider  the  training  error 

err(A)  =  \  \Y  —  Y  (A)  1 12  (B.9) 


and  the  in-sample  error 


Errin(A)  =  E  y 


||f-F(A)||2 


(B.10) 


In  Err;n(A),  we  compute  the  expectation  over  new  observations  Y  for  the  same  stimuli  Ut  and  compare 
them  to  the  predictions  Y( A)  obtained  with  the  initial  experiment  ( Y,U ).  Thus,  Errin(A)  gives  a 
measure  of  the  generalization  error  of  our  results.  Err;n(A)  itself  cannot  be  computed  directly,  but 
we  can  compute  its  expectation  with  respect  to  the  original  observations  Y.  For  this,  let  us  consider 
first  the  difference  between  Eri‘ira  and  err,  called  the  optimism  (Friedman  et  ah,  2008).  Denoting  the 
components  of  Y  with  an  index  i.  it  is  easy  to  verify  that  the  expected  optimism  with  respect  to  Y 
is 


w(A) 


(Errin(A)  -  err(A)) 

ST 


2^(yy(A))-(Fi)<Fi(A)) 
2—  1 
ST 


2^Cov(y,L;;(A)). 


(B.ll) 

(B.12) 

(B.13) 
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For  the  general  case  I\  >  1,  we  will  have  X  G  M.STxNK .  Let  us  assume  that  ST  >  NK  and  that 
X  is  full  rank,  that  is,  rank(X)  =  NK.  Then  in  (Zou  et  ah,  2007)  it  was  shown  that  if  we  define 
d{ A)  =  ||W'(A)||o  as  the  number  of  non-zero  components  in  W( A),  we  have4 

w(A)  =  2(d(A))  Cy.  (B.14) 

Thus  2d(A)  Cy  is  an  unbiased  estimate  of  w(A),  and  is  also  consistent  (Zou  et  al.,  2007).  With  this 
result,  and  using  err(A)  as  an  estimate  of  (err(A)),  we  obtain  an  estimate  of  the  average  generalization 
error  (Errj„(A))  as 

Cp( A)  =  \\Y  -  F(A)||2  +  2d(A)  Cy  .  (B.15) 

This  quantity  can  be  used  to  select  the  best  A  as  that  value  that  minimizes  Cp( A).  Since  the  first 
term  is  a  non-decreasing  function  of  A  (Zou  et  ah,  2007),  it  is  enough  to  evaluate  Cp( A)  for  each  d  at 
the  smallest  value  of  A  at  which  there  are  d  active  weights  in  W( A).  With  a  slight  abuse  of  notation, 
the  resulting  set  of  discrete  values  of  (B.15)  will  be  denoted  as  Cp{d). 


C  The  low-rank  block-Thomas  algorithm 


In  this  appendix  we  will  present  a  fast  approximation  technique  to  perform  multiplications  by  the 
inverse  Hessian  Hyy.  The  NT  x  NT  Hessian  Hyv  in  (A. 6)  takes  the  block-tridiagonal  form 


Hvv 


(-Cq1-AtA  At  0  ...  \ 

A  -I-  AT A  AT  0  ... 

0  A  -I  -  AT A  AT  0 


V 

(BjC~1B1 

B^C~1B2 

V 


\ 

Bj'Cy1BTj 


A 


-I) 


(C.l) 


where  we  have  set  Cy  =  I  to  simplify  the  notation.  We  will  restore  it  below  to  a  generic  value. 

It  will  be  convenient,  following  Paninski  (2010),  to  adopt  for  Co,  the  covariance  of  the  initial 
voltage  V\ ,  the  value 

OO 


Co  =  J2(aaTY  =  (J  -  aa7’)-1 


(C.2) 


»= o 

(note  that  the  dynamics  matrix  A  is  stable  here,  ensuring  the  convergence  of  this  infinite  sum).  This 
is  the  stationary  prior  covariance  of  the  voltages  V)  in  the  absence  of  observations  yt,  and  with  this 
value  for  Co,  the  top  left  entry  in  the  first  matrix  in  (C.l)  simplifies  to  —  Cq1  —  AT A  =  —  I. 

We  want  to  calculate 


(b\y 

/ah\ 

Hy'yb  =  Hy], 

^2 

= 

x2 

\bT/ 

\Xt) 

where  b  can  be  an  arbitrary  IVT-dimensional  vector  and  each  5,;  and  Xi  is  a  column  vector  with 
length  N.  We  can  calculate  this  using  the  block  Thomas  algorithm  for  tridiagonal  systems  of 
equations  (Press  et  al.,  1992),  which  in  general  requires  0(N3T)  time  and  0(N2T)  space,  as  shown 
in  Algorithm  1. 


4We  have  verified,  through  Monte  Carlo  simulations  similar  to  those  in  (Zou  et  al.,  2007),  that  this  result  also  holds  in 
the  positive  constrained  case. 


36 


Algorithm  1  Standard  Block  Thomas  Algorithm  for  calculating  HVyh 

Oil  < - (l  +  Bf  Cy1  Bi) 

71  <—  a^lAT 

Vi  a[lbi 

for  i  =  2  to  T  —  1  do 

OLi  i - (I  +  AT A  +  Bj Cy  1Bi  +  A^i-i) 

li  ^AT 

yi  -e-  ^  -  Ayj_i) 

end  for 

«r  * - (/  +  B^Cy1  Bt  +  -A7T-1) 

xt  o^1(6t  —  ^4yr-i) 

for  i  =  T  —  1  to  1  do 

f  Ui  'Yi%i+ 1 

end  for 


We  can  adapt  this  algorithm  to  yield  an  approximate  solution  to  (C.3)  in  0(TNS 2)  time  by  using 
low-rank  perturbation  techniques  similar  to  those  used  in  Paninski  (2010);  Huggins  and  Paninski 
(2012);  Pnevmatikakis,  Paninski,  Rad  and  Huggins  (2012).  The  first  task  is  to  calculate  a)"1.  Using 
the  Woodbury  matrix  lemma,  we  get 

ai1  =  —(I  +  Bi  Cy1  Bi)-1  (C.4) 

=  -I  +  BT(CV  +  B1B’[)~1B1  (C.5) 


where 


—  — I  +  L\DiL1  G 


L,  =  B 


Di  —  {Cv  +  BiB^)-1  £RSxS.  (C.8) 

Note  that  the  simple  expression  (C.6)  for  a^1  follows  from  the  form  we  chose  in  (C.2)  for  Cq. 
Plugging  ay  1  into  the  Algorithm  l’s  expression  for  71  gives 


7i  =  «i  1AT 

=  —AT  +  LiDiL^A7 


(C.9) 

(C.10) 


To  continue  the  recursion  for  the  other  at  xs,  the  idea  is  to  approximate  these  matrices  as  low-rank 
perturbations  to  — 

a-1  w  -I  +  LiDiLf  G  RNxN  ,  (C.ll) 

where  D,;  is  a  small  di  x  di  matrix  with  d,;  <C  N  and  L,;  G  RNxdi.  This  in  turn  leads  to  a  form  similar 
to  (C.10)  for  7 j, 

H~-AT  +  LiDiLf  AT.  (C.12) 

Therefore  we  can  write 


=  -{I  +  ATA  +  BjC-1Bl  +  All_1)~l 

~  — (/  +  A^  A  +  B{  Cy  1  Bi  —  AAA  +  ALi^iDi—iLf_iAr^^) 
»  —(I  +  Bf  C~1Bi  +  ALl_iDi_iLj_1AT)-1 . 
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(C.13) 

(C.14) 

(C.15) 


This  expression  justifies  our  approximation  of  a^s  as  a  low  rank  perturbation  to  —I:  the  term 
BfC~1Bi  is  low  rank  because  the  number  of  measurements  is  S'  <C  N,  and  the  second  term  is  low 
rank  because  the  condition  eigs(A)  <  1  tends  to  suppress  at  step  i  the  contribution  of  the  previous 

step  encoded  in  Li_iDi_iLj_1.  See  Pnevmatikakis,  Paninski,  Rad  and  Huggins  (2012)  for  details. 

To  apply  Woodbury  we  choose  a  basis  for  the  two  non-identity  matrices, 

Oi  =  [ALi_ i  Bj]  G  (C.16) 

and  write 

BjC-'Bi  +  ALi_1Di_1Lf_1AT  =  O.A'Wf ,  (C.17) 

where 

Mi  =  (^Di~l  G  ^(S+di-iMS+di-!) 

Applying  Woodbury  gives 

a”1  =  -(I  +  OiMiOf)-1  (C.18) 

=  -I  +  Ol(M~1+OfOl)-1Of.  (C.19) 

We  obtain  Li  and  Di  by  truncating  the  SVD  of  the  expression  on  the  right-hand  side:  in  Matlab, 
for  example,  do 


[!/,£>']  =  svdiOiiMr1  +  OfOi)-1' 2,^econ,  ),  (C.20) 

then  choose  Li  as  the  first  di  columns  of  L'  and  Di  as  the  square  of  the  first  di  diagonal  elements 
D' ,  where  di  is  chosen  to  be  large  enough  (for  accuracy)  and  small  enough  (for  computational 
tractability) . 

We  must  handle  a^1  slightly  differently  because  of  the  boundary  condition.  Making  use  of  the 
fact  that  Cq1  =  I  —  AAT  and  the  Woodbury  identity,  we  get 

a t1  =  —  (I  +  Bj:C~1Bt  +  A'jt-i)-1 

=  -( /  +  B^C^Bt  -  AAt  +  ALT-iDT-iL^_1At)-1 
=  -(Co1  +  OtMtO^)-1 

=  -C0  +  CoOt(M~1  +  O^C0Ot)-1O^Co 

=  — Co  +  LtDtL^, 

where 

Lt  =  CqOt  (C.26) 

and 

Dt  =  (. M -1  +  O^CoOt)-1.  (C.27) 

Multiplications  by  a^1  are  efficient  since  we  can  multiply  by  Co  in  O(N)  time,  expoiting  the 
sparse  structure  of  A  (see  (Paninski,  2010)  for  details).  It  is  unnecessary  to  control  the  rank  because 
we  will  only  be  performing  one  multiplication  with  a^1  and  calculating  the  SVD  is  a  relatively 
expensive  operation. 


(C.21) 

(C.22) 

(C.23) 

(C.24) 

(C.25) 
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The  updates  for  calculating  iji  and  Xi  are  straightforward: 


2/1  =  <*i  1h 

(C.28) 

=  —  &i  +  LiD\Lfbi 

(C.29) 

II 

P 

1 

O- 

1 

T* 

(C.30) 

=  (-1  +  LiDiLf  )(bi  -  AVi_ 0 

(C.31) 

xt  =  (Xrp1  (br  —  Cy1  AyT-i) 

(C.32) 

=  (—do  +  LtDtL f){br  —  Ayr- 1) 

(C.33) 

Xi  =  yi  -  jiXi+i 

(C.34) 

=  Hi  +  ATa;?:+i  —  LiDiLf  AT  Xi+\. 

(C.35) 

Algorithm  2  summarizes  the  full  procedure.  One  can  verify  that  the  total  computational  cost  scales 
like  0(TNS2)  (see  Pnevmatikakis,  Paninski,  Rad  and  Huggins  (2012)  for  details). 

Finally,  note  that  for  repeated  calls  to  Hyy b,  we  can  compute  the  matrices  Li,Di  once  and 
store  them.  For  the  case  when  Cy  is  not  the  identity  we  can  apply  a  linear  whitening  change  of 
variables  V{  =  C^f^Vt-  We  solve  as  above  except  we  make  the  substitution  Bt  — >  BtCy  2  and  our 
final  solution  now  has  the  form 

x  =  (lT®  Cl/2)  Hy/  (lT  0  C//2)  T  b. 


Algorithm  2  Low  Rank  Block  Thomas  Algorithm  for  calculating  Hyy b 
L\  <—  Bf 

D\  <-  ( Cy  +  BlB/)-1 

Vi  t - b\  +  L\DiL/bi 

for  i  =  2  to  T  do 
Ot  •<—  [AL-t- 1  Bf] 

Mi  -e-  diag(A-i,  C-V') 

if  i  f  T  then 

[L/DW  «-  svd(Oi(Mr 1  +  Of  Of-1/2,  ‘econ’  ) 

2/!t-(-/  +  L'D'Lf)(6i-%_1) 
control  rank  of  L(  and  D\  to  obtain  Lj  and  Di 
else 

L*  •<—  CoOj 

A  <-  (Mr1  +  Of  CoO,)-1 

end  if 
end  for 

xt  <—  (—Co  +  LtDtL^^t  —  AyT-i) 
for  i  =  T  —  1  to  1  do 

Xi  Ir-  m  +  ATXi+ 1  -  LiDiLf  ATXi+i 

end  for 


39 


Acknowledgements 

This  work  was  supported  by  an  NSF  CAREER  grant,  a  McKnight  Scholar  award,  and  by  NSF 
grant  IIS-0904353.  This  material  is  based  upon  work  supported  by,  or  in  part  by,  the  U.  S.  Army 
Research  Laboratory  and  the  U.  S.  Army  Research  Office  under  contract  number  W911NF-12-1-0594. 
JHH  was  partially  supported  by  the  Columbia  College  Rabi  Scholars  Program.  AP  was  partially 
supported  by  the  Swartz  Foundation.  The  computer  simulations  were  done  in  the  Hotfoot  HPC 
Cluster  of  Columbia  University.  We  thank  E.  Pnevmatikakis  for  helpful  discussions  and  comments. 


References 

Barbour,  B.,  Brunei,  N.,  Hakim,  V.  and  Nadal,  J.-P.  (2007),  ‘What  can  we  learn  from  synaptic 
weight  distributions?’,  TRENDS  in  Neurosciences  30(12),  622-629. 

Bloomfield,  S.  and  Miller,  R.  (1986),  ‘A  functional  organization  of  ON  and  OFF  pathways  in  the 
rabbit  retina’,  J.  Neurosci.  6(1),  1  13. 

Candes,  E.,  Romberg,  J.  and  Tao,  T.  (2006),  ‘Stable  signal  recovery  from  incomplete  and  inaccurate 
measurements’,  Communications  on  pure  and  applied  mathematics  59(8),  1207-1223. 

Candes,  E.  and  Wakin,  M.  (2008),  ‘An  introduction  to  compressive  sampling’,  Signal  Processing 
Magazine,  IEEE  25(2),  21  -30. 

Canepari,  M.,  Djurisic,  M.  and  Zecevic,  D.  (2007),  ‘Dendritic  signals  from  rat  hippocampal  CA1 
pyramidal  neurons  during  coincident  pre-  and  post-synaptic  activity:  a  combined  voltage-  and 
calcium-imaging  study’,  J  Physiol  580(2),  463-484. 

Canepari,  M.,  Vogt,  K.  and  Zecevic,  D.  (2008),  ‘Combining  voltage  and  calcium  imaging  from 
neuronal  dendrites’,  Cellular  and  Molecular  Neurobiology  28,  1079-1093. 

Djurisic,  M.,  Antic,  S.,  Chen,  W.  R.  and  Zecevic,  D.  (2004),  ‘Voltage  imaging  from  dendrites  of 
mitral  cells:  EPSP  attenuation  and  spike  trigger  zones’,  J.  Neurosci.  24(30),  6703-6714. 

Djurisic,  M.,  Popovic,  M.,  Carnevale,  N.  and  Zecevic,  D.  (2008),  ‘Functional  structure  of  the  mitral 
cell  dendritic  tuft  in  the  rat  olfactory  bulb’,  J.  Neurosci.  28(15),  4057-4068. 

Dombeck,  D.  A.,  Blanchard-Desce,  M.  and  Webb,  W.  W.  (2004),  ‘Optical  recording  of  action  poten¬ 
tials  with  second-harnronic  generation  microscopy’,  J.  Neurosci.  24(4),  999-1003. 

Durbin,  J.,  Koopman,  S.  and  Atkinson,  A.  (2001),  Time  series  analysis  by  state  space  methods , 
Vol.  15,  Oxford  University  Press  Oxford. 

Efron,  B.  (2004),  ‘The  estimation  of  prediction  error’,  Journal  of  the  American  Statistical  Association 
99(467),  619-632. 

Efron,  B.,  Hastie,  T.,  Johnstone,  I.  and  Tibshirani,  R.  (2004),  ‘Least  angle  regression’,  Annals  of 
Statistics  32,  407-499. 

Fisher,  J.  A.  N.,  Barchi,  J.  R.,  Welle,  C.  G.,  Kim,  G.-H.,  Kosterin,  P.,  Obaid,  A.  L.,  Yodh,  A.  G., 
Contreras,  D.  and  Salzberg,  B.  M.  (2008),  ‘Two-photon  excitation  of  potentiometric  probes  enables 
optical  recording  of  action  potentials  from  mammalian  nerve  terminals  in  situ’,  J  Neurophysiol 
99(3),  1545-1553. 

Friedman,  J.,  Hastie,  T.,  Hofling,  H.  and  Tibshirani,  R.  (2007),  ‘Pathwise  coordinate  optimization’, 
The  Annals  of  Applied  Statistics  1(2),  302-332. 


40 


Friedman,  J.,  Hastie,  T.  and  Tibshirani,  R.  (2008),  The  Elements  of  Statistical  Learning ,  Spinger. 

Gelman,  A.,  Carlin,  J.,  Stern,  H.  and  Rubin,  D.  (2004),  Bayesian  data  analysis ,  CRC  press. 

Geman,  S.,  Bienenstock,  E.  and  Doursat,  R.  (1992),  ‘Neural  networks  and  the  bias/variance 
dilemma’,  Neural  computation  4(1),  1-58. 

Gobel,  W.  and  Helmchen,  F.  (2007),  ‘New  angles  on  neuronal  dendrites  in  vivo’,  J  Neurophysiol 
98(6),  3770-3779. 

Hines,  M.  (1984),  ‘Efficient  computation  of  branched  nerve  equations’,  International  Journal  of  Bio- 
Medical  Computing  15(1),  69  76. 

Huber,  P.  (1964),  ‘Robust  estimation  of  a  location  parameter’,  The  Annals  of  Mathematical  Statistics 
35(1),  73-101. 

Huggins,  J.  and  Paninski,  L.  (2012),  ‘Optimal  experimental  design  for  sampling  voltage  on  dendritic 
trees’,  In  Press,  Journal  of  Computational  Neuroscience  . 

Huys,  Q.,  Ahrens,  M.  and  Paninski,  L.  (2006),  ‘Efficient  estimation  of  detailed  single-neuron  models’, 
Journal  of  Neurophysiology  96,  872-890. 

Huys,  Q.  and  Paninski,  L.  (2009),  ‘Model-based  smoothing  of,  and  parameter  estimation  from,  noisy 
biophysical  recordings’,  PLOS  Computational  Biology  5,  el000379. 

Iyer,  V.,  Hoogland,  T.  M.  and  Saggau,  P.  (2006),  ‘Fast  functional  imaging  of  single  neurons  using 
random-access  multiphoton  (RAMP)  microscopy’,  J  Neurophysiol  95(1),  535-545. 

Knopfel,  T.,  Diez-Garcia,  J.  and  Akemann,  W.  (2006),  ‘Optical  probing  of  neuronal  circuit  dynamics: 
genetically  encoded  versus  classical  fluorescent  sensors’,  Trends  in  Neurosciences  29,  160-166. 

Kralj,  J.,  Douglass,  A.,  Hochbaum,  D.,  Maclaurin,  D.  and  Cohen,  A.  (2011),  ‘Optical  recording  of 
action  potentials  in  mammalian  neurons  using  a  microbial  rhodopsin’,  Nature  Methods  . 

Larkum,  M.  E.,  Watanabe,  S.,  Lasser-Ross,  N.,  Rhodes,  P.  and  Ross,  W.  N.  (2008),  ‘Dendritic 
properties  of  turtle  pyramidal  neurons’,  J  Neurophysiol  99(2),  683-694. 

Lin,  Y.  and  Zhang,  H.  (2006),  ‘Component  selection  and  smoothing  in  multivariate  nonparametric 
regression’,  The  Annals  of  Statistics  34(5),  2272-2297. 

Mallows,  C.  (1973),  ‘Some  comments  on  Cp’,  Technometrics  pp.  661-675. 

Milojkovic,  B.  A.,  Zhou,  W.-L.  and  Antic,  S.  D.  (2007),  ‘Voltage  and  calcium  transients  in  basal 
dendrites  of  the  rat  prefrontal  cortex’,  J  Physiol  585(2),  447-468. 

Mishchenko,  Y.  and  Paninski,  L.  (2012),  ‘A  Bayesian  compressed-sensing  approach  for  reconstructing 
neural  connectivity  from  subsampled  anatomical  data’,  Under  review  . 

Mishchenko,  Y.,  Vogelstein,  J.  and  Paninski,  L.  (2011),  ‘A  Bayesian  approach  for  inferring  neuronal 
connectivity  from  calcium  fluorescent  imaging  data’,  Annals  of  Applied  Statistics  5,  1229-1261. 

Mitchell,  T.  J.  and  Beauchamp,  J.  J.  (1988),  ‘Bayesian  variable  selection  in  linear  regression’,  Journal 
of  the  American  Statistical  Association  83(404),  1023-1032. 

Nikolenko,  V.,  Watson,  B.,  Araya,  R.,  Woodruff,  A.,  Peterka,  D.  and  Yuste,  R.  (2008),  ‘SLM  mi¬ 
croscopy:  Scanless  two-photon  imaging  and  photostimulation  using  spatial  light  modulators’, 
Frontiers  in  Neural  Circuits  2,  5. 


41 


Nuriya,  M.,  Jiang,  J.,  Nemet,  B.,  Eisenthal,  K.  and  Yuste,  R.  (2006),  ‘Imaging  membrane  potential 
in  dendritic  spines’,  PNAS  103,  786-790. 

Packer,  A.  M.,  Peterka,  D.  S.,  Hirtz,  J.  J.,  Prakash,  R.,  Deisseroth,  K.  and  Yuste,  R.  (2012),  ‘Two- 
photon  optogenetics  of  dendritic  spines  and  neural  circuits’,  Nature  methods  9(12),  1202-1205. 

Pakman,  A.  and  Paninski,  L.  (2013),  ‘Exact  lramiltonian  monte  carlo  for  truncated  multivariate 
gaussians’,  Journal  of  Computational  and  Graphical  Statistics,  preprint  arXiv:1208.4118  ■ 

Paninski,  L.  (2010),  ‘Fast  Kalman  filtering  on  quasilinear  dendritic  trees’,  Journal  of  Computational 
Neuroscience  28,  211-28. 

Paninski,  L.  and  Ferreira,  D.  (2008),  ‘State-space  methods  for  inferring  synaptic  inputs  and  weights’, 
CO SYNE . 

Paninski,  L.,  Vidne,  M.,  DePasquale,  B.  and  Ferreira,  D.  (2012),  ‘Inferring  synaptic  inputs  given 
a  noisy  voltage  trace  via  sequential  monte  carlo  methods’,  In  Press,  Journal  of  Computational 
Neuroscience  . 

Peterka,  D.,  Takahashi,  H.  and  Yuste,  R.  (2011),  ‘Imaging  voltage  in  neurons’,  Neuron  69(1),  9-21. 

Pnevmatikakis,  E.  A.,  Kellelier,  K.,  Chen,  R.,  Saggau,  P.,  Josic,  K.  and  Paninski,  L.  (2012),  Fast 
spatiotemporal  smoothing  of  calcium  measurements  in  dendritic  trees,  submitted. 

Pnevmatikakis,  E.  A.  and  Paninski,  L.  (2012),  ‘Fast  interior-point  inference  in  high-dimensional 
sparse,  penalized  state-space  models’,  Proceedings  of  the  15th  International  Conference  on  Ar¬ 
tificial  Intelligence  and  Statistics  (AISTATS)  2012,  La  Palma,  Canary  Islands.  Volume  XX  of 
JMLR:  W&CP  XX.  . 

Pnevmatikakis,  E.  A.,  Paninski,  L.,  Rad,  K.  R.  and  Huggins,  J.  (2012),  ‘Fast  Kalman  filtering  and 
forward-backward  smoothing  via  a  low-rank  perturbative  approach’,  Journal  of  Computational 
and  Graphical  Statistics  .  in  press. 

Press,  W.,  Teukolsky,  S.,  Vetterling,  W.  and  Flannery,  B.  (1992),  Numerical  recipes  in  C ,  Cambridge 
University  Press. 

Reddy,  G.  D.  and  Saggau,  P.  (2005),  ‘Fast  three-dimensional  laser  scanning  scheme  using  acousto¬ 
optic  deflectors’,  J  Biomed  Opt  10(6),  064038. 

Sacconi,  L.,  Dombeck,  D.  A.  and  Webb,  W.  W.  (2006),  ‘Overcoming  photodamage  in  second- 
harmonic  generation  microscopy:  Real-time  optical  recording  of  neuronal  action  potentials’,  Pro¬ 
ceedings  of  the  National  Academy  of  Sciences  103(9),  3124-3129. 

Sjostrom,  P.  J.,  Rancz,  E.  A.,  Roth,  A.  and  Hausser,  M.  (2008),  ‘Dendritic  Excitability  and  Synaptic 
Plasticity’,  Physiol.  Rev.  88(2),  769-840. 

Smith,  C.  (2013),  ‘Low-rank  graphical  models  and  Bayesian  analysis  of  neural  data’,  PhD  Thesis, 
Columbia  University  . 

Song,  S.,  Sjostrom,  P.  J.,  Reigl,  M.,  Nelson,  S.  and  Chklovskii,  D.  B.  (2005),  ‘Highly  nonrandom 
features  of  synaptic  connectivity  in  local  cortical  circuits’,  PLoS  biology  3(3),  e68. 

Studer,  V.,  Bobin,  J.,  Chahid,  M.,  Mousavi,  H.,  Candes,  E.  and  Dahan,  M.  (2012),  ‘Compressive 
fluorescence  microscopy  for  biological  and  hyperspectral  imaging’,  Proceedings  of  the  National 
Academy  of  Sciences  109(26),  E1679-E1687. 


42 


Takaliaslii,  N.,  Kitamura,  K.,  Matsuo,  N.,  Mayford,  M.,  Kano,  M.,  Matsuki,  N.  and  Ikegaya,  Y. 
(2012),  ‘Locally  synchronized  synaptic  inputs’,  Science  335(6066),  353-356. 

Tibshirani,  R.  (1996),  ‘Regression  shrinkage  and  selection  via  the  lasso’,  Journal  of  the  Royal  Sta¬ 
tistical  Society.  Series  B  58,  267-288. 

Vucinic,  D.  and  Sejnowski,  T.  J.  (2007),  ‘A  compact  multiphoton  3d  imaging  system  for  recording 
fast  neuronal  activity’,  PLoS  ONE  2(8),  e699. 

Yuan,  M.  and  Lin,  Y.  (2006),  ‘Model  selection  and  estimation  in  regression  with  grouped  variables’, 
Journal  of  the  Royal  Statistical  Society:  Series  B  (Statistical  Methodology)  68(1),  49-67. 

Zou,  H.  and  Hastie,  T.  (2005),  ‘Regularization  and  variable  selection  via  the  elastic  net’,  Journal  of 
the  Royal  Statistical  Society:  Series  B  (Statistical  Methodology)  67(2),  301-320. 

Zou,  H.,  Hastie,  T.  and  Tibshirani,  R.  (2007),  ‘On  the  “degrees  of  freedom”  of  the  lasso’,  The  Annals 
of  Statistics  35(5),  2173-2192. 


43 


